Breast Cancer Computer-Aided Detection System based on Simple Statistical Features and SVM Classification

Computer-Aided Detection (CADe) systems are becoming very helpful and useful in supporting physicians for early detection of breast cancer. In this paper, a CADe system that is able to detect abnormal clusters in mammographic images will be implemented using different classifiers and features. The CADe system will utilize a Support Vector Machine (SVM) and K-Nearest Neighbor (KNN) as classifiers. Adopting mammographic database from Mammographic Image Analysis Society (MIAS), for training and testing, the performance of the two types of classifiers are compared in terms of sensitivity, specificity, and accuracy. The obtained values for the previous parameters show the efficiency of the CADe system to be used as a secondary screening method in detecting abnormal clusters given the Region of Interest (ROI). The best classifier is found to be SVM showed 96% accuracy, 92% sensitivity and 100% specificity.


I. INTRODUCTION
Breast cancer is a disease occurred when the cells in the female breast grow randomly and out of control. The type of breast cancer depends on morphology and proliferation. A female human breast is made up of three main parts: lobules, ducts, and connective tissue. The lobules are the glands that produce milk. The ducts are the tubes that carry milk to the nipple. The connective tissue (which consists of fibrous and fatty tissue) surrounds and holds everything together [1]. Most breast cancers begin in the ducts or lobules.
Breast cancer can spread in later stages outside the breast through blood vessels and lymph vessels. When breast cancer spreads to other parts of the body, it is said to have metastasized [1]. Breast cancer is known to be the most lethal among abnormal masses leading to deaths of 2.09 million women globally in 2018 according to World Health Organization (WHO) [2]. Recently, the survival rates have been increased due to more awareness about the disease from social media and more availability and advancement of healthcare technology especially mammography and other diagnostic imaging techniques [3]. Mammography is commonly used as a diagnostic imaging technique for detecting breast cancer due to its availability, less imaging duration, and lower cost than other methods such as Magnetic Resonance Imaging (MRI). On the other hand, Ultrasound Imaging is lower in cost but worse in terms of reproducible mapping to physical location.
In mammography, there are some factors that can lead to wrong decisions among the physicians, such as the appearance of microcalcifications. Furthermore, biopsy is painful for patients to support surgery decision. Hence, the use of CADe systems may ensure the decision without the need of biopsy. Our CADe system assumes known ROI by radiologist and supposed to aid at least as a secondary diagnosis method to support surgery decision.

II. LITERATURE REVIEW
Several previous studies have been published involving CADe system for breast cancer using mammography, contributed in presenting preprocessing algorithms, new features of more statistical significance or more relevant to the morphology of the abnormal images, and classifiers of better performance combined with set of features.
Arai et al. [4] separated the database taken from Japanese Society of Computer Aided Medical Imaging Technology into two parts, training and testing with the data proportion were 74% and 26%, respectively. The author used the features that are mostly statistical including mean, variance, max, coefficient of variation, standard deviation, and two additional features, 7 Hu moments and centroid. These features are extracted from Wavelet decomposition results of each detail, horizontal, approximation, diagonal, and vertical details. Support Vector Machine (SVM) classifier is used, and obtained sensitivity and specificity of 90% and 91.43%, respectively. This study included features obtained after image transformation that may complicate the training process of the classifier and it is more computationally expensive.
Khaoula et al. [5] proposed a Computer Aided Diagnosis system using Mini-MIAS database to detect the abnormal areas in digital mammograms, using only the dense breast category and classifies them into abnormal (benign and malignant) and normal. Then, electromagnetism-like (EML) optimization algorithm, followed by the edge-based detection algorithm FIS (Fuzzy Inference System) were used to identify the suspicious structures. As a result, the performance of this method with SVM classifier in terms of accuracy is 86.36%. The features www.ijacsa.thesai.org used in this study are computationally expensive while accuracy attained is low.
Pratiwi et al. [6] found that Radial Basis Function Neural Network (RBFNN) is more accurate in classifying digital mammogram image with sensitivity of 97.22% and specificity of 91.49% for normal and abnormal classification (CADe), while in classifying benign and malignant lesions (Computer Aided Diagnosis or CADx), RBFNN's sensitivity is 100% and specificity is 89.47%. The author used features from Graylevel Co-occurrence Matrix (GLCM) and suggested that using another texture-based feature extraction, such as wavelet or curvelet, may be used in breast cancer classification in the purpose of improving the accuracy.
Setiawan et al. [7] studied the usage of Law's Texture Energy Measure (LAWS) features as descriptors for classifying mammogram images. Based on result of the experiment, LAWS features give better accuracy when classifying mammogram images compared to GLCM features. The true accuracy value of benign-malignant classification (CADx) is 78.21%, but using GLCM feature, the accuracy less than 55% for each degree. In this study, the author used ANN as classifier, suggested improvement can be done by changing the architecture of neural network model or by changing the number of nodes in the hidden layer.
Saad et al. [8] introduced an algorithm using Otsu's method for detection of Microcalcifications (MCs) and automatic diagnoses of breast cancer has been developed. The enhancement evaluation parameters such as contrast improvement index (CII), peak signal-to-noise ratio (PSNR), and Edge Preservation Index (EPI) conclude that enhancement algorithm significantly improved the contrast of MCs against the background and hence improved detection of MCs. The algorithm implemented also shows that adaptive boosting (Adaboost) classification is more sensitive and accurate for the detection of both single and clustered MCs as compared to the ANN [14]. The algorithm was tested for The Digital Database for Screening Mammography (DDSM), MIAS and local database and showed high level of overall accuracy (98.68%) and sensitivity (80.15%).
Pavel et al. [9] proposed a breast cancer detection method which uses Local Binary Patterns (LBP) features for breast representation. The proposed method was evaluated on a set created from MIAS and DDSM databases. The method showed accuracy close to 84% using SVM classifier only. This study used only LBP features which showed attractive accuracy [13]. The overall performance of the classifier can be improved if the ROI has been specified in this study. Table. I summarize the previous studies involving breast cancer images using mammogram.

III. DATABASE
MIAS is organized by U.K research groups that are interested in the understanding of mammograms and for image processing and recognition [10]. MIAS database consists of 322 images, which belong to three classes normal, benign and malignant. There are 208 normal, 63 benign and 51 malignant mammograms.
The detailed information about MIAS database included in an introduction file in seven information columns for each mammogram, for more information, refer to [11].
The dataset used in this study is part of MIAS database, includes 72 normal images non-cancerous and 72 abnormal ones cancerous (total 96 of which are used for training including 48 normal and 48 abnormal, and total 48 of which are used for testing including 24 normal and 24 abnormal) (diagnostic details of abnormal cases are shown in Table II). The software used in this study is MATLAB V2019b network licensed through the university system.  Other, ill-defined masses MISC 12 IV. METHODOLOGY 1) Preprocessing: Region of Interest (ROI) of 32x32 pixels is cropped around the marked center of the suspicious area marked by radiologist for all the dataset images. This is to reduce the computational load and to make feature computation more concentrated in the ROI not distracted by other details in the whole breast image [12]. During our study, we tried using the full size of the mammogram images (1024x1024), but the results were not significant.
2) Features Extraction: Initially we computed 94 features starting from the first order statistics (14 features) and texture features (64 Histogram features and 16 GLCM features). Then, using the T-test (significance p-value < 5%) and classifiers performance, the added features are eliminated manually after checking both the P-value of t-test and classifiers' performance parameters including accuracy, sensitivity and specificity. At the end, the final most contributing features used in this study after rounds of trial and error are first order statistical ones including mean, median, mode, and quantile (0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, and 0.9), those showed best t-test significance along with best classification performance.
3) Classifiers: The classifiers used in this study are shown in Table. III: Fig. 1 illustrates the iterative steps used while designing the CADe system. Each block/step will be explained more within the following text.   Table IV shows that the best classifier was SVM-Linear with accuracy = 96% and Sensitivity = 92%. Followed by KNN-3 with an error that is equal to 6% only. The results in comparing to the previous studies were satisfying as the features used were only the simple first order statistics.

VI. CONCLUSION AND DISCUSSION
In this study, the final results were impressive in comparing to previous studies those used SVM classification. Given that, we used here simple first order statistics features. On the other hand, with Neural Network based classifiers, previous studies showed that computationally more expensive features gave comparable results to what we got here. Future studies can contribute by adding the microcalcifications (MCs) to the dataset (we excluded MCs in this study) and using the sophisticated classifiers, such as, ANN and RBFNN.