An Ontological Model based on Machine Learning for Predicting Breast Cancer

—Breast cancer is mostly a female disease, but it may affect men as well even at a considerably lower percentage. An automated diagnosis system should be built for early detection because manual breast cancer diagnosis takes a long time. Doctors have lately achieved significant advances in the early identification and treatment of breast cancer in order to decrease the rate of mortality caused by the latter. Researchers, on the other hand, are analysing large amounts of complicated medical data by employing a combination of statistical and machine learning methodologies to assist clinicians in predicting breast cancer. Various machine learning approaches, including ontology-based Machine Learning methods, have lately played an essential role in medical science by building an automated system that can identify breast cancer. This study examines and evaluates the most popular machine learning algorithms, besides the ontological model based on Machine Learning. Among the classification methods investigated were Naive Bayes, Decision Tree, Logistic Regression, Support Vector Machine, Artificial Neural Network, Random Forest, and k-Nearest Neighbours. The dataset utilized has 683 instances and is available for download from the Kaggle website. The findings are assessed using performance measures generated from the confusion matrix, such as F-Measure, Accuracy, Precision, and Recall. The ontology model surpassed all machine learning techniques, according to the results.


I. INTRODUCTION
In 2020, 2.3M women were identified with breast cancer, with 685 thousand fatalities worldwide. By the end of 2020, there will have been 7.8M women diagnosed with breast cancer in the previous five years, making it the most frequent kind of cancer in the world. Breast cancer claims more DALYs from women than any other kind of cancer worldwide. Breast cancer affects women of all ages after puberty in every country throughout the world, but at a rising rate in the latter stages of life. From the 1930s through the 1970s, there was minimal change in the breast cancer death rate. In nations where early diagnosis systems are available in conjunction with various treatment options to remove intrusive illnesses, life expectancies started to get better as in 1980s.
Machine learning (ML) is one of the most constantly evolving areas of computer science, with a wide range of applications [1], [2]. It is the process of obtaining usable information from a big quantity of data [3]. Marketing, Industry, Medical diagnosis, and other scientific domains all make use of ML approaches. ML algorithms are well-suited for medical data analysis since they have been frequently employed in medical datasets. ML comes in several forms, including classification, regression, and clustering. Each form has a particular consequence and influence depending on the problem that we are attempting to address. We focus on classification algorithms in our work because of their high accuracy and performance in classifying a given dataset into predetermined categories and predicting future events or information from that data. In the medical field, classification algorithms are often utilized, particularly in the diagnosis of illnesses such as breast cancer. As a result, regularly used machine learning classification methods such as Support Vector Machine (SVM), k-Nearest Neighbours (KNN), Artificial Neural Network (ANN), Naive Bayes (NB), Logistic regression (LR), and Decision Tree (DT) are utilized to detect patients with breast cancer at an early stage.
Several researches have been conducted, and various machine learning models have been implemented, to identify and predict breast cancer diagnoses [4]. For example, this study [5] sought to identify the most accurate machine learning approaches for predicting patients with breast cancer. Various supervised machine-learning algorithms, including RF, KNN, DT, AdaboostM1, LR, and ANN [6], were used and their performance was compared. The authors of this study [7], examined linear discriminant analysis with support vector machines in terms of specificity, F-Measure, accuracy, and sensitivity to see which method is better for classifying breast cancer datasets. The results reveal that the support vector machine outperforms the linear discriminant analysis, with a 98.77 percent accuracy rate. [8] This study covers the whole Bayesian technique to assess the predicted distribution of all classes using three datasets and three classifiers: naive bayes (NB), bayesian networks (BN), and tree augmented naive bayes (TAN). The outcomes showed that the BN method performed the best, with an accuracy rate of 97 percent.
Breast cancer diagnosis and prediction has garnered a lot of attention in recent years, and numerous ways have been taken to address this issue [9]- [12]. The present focus is on machine learning and the semantic web. The Wisconsin Breast Cancer Dataset was used in this study [13]. The authors' purpose is to study the dataset and assess the efficacy of several ML algorithms for predicting breast cancer. Several machine learning methods have been developed to differentiate benign and malignant tumors. To predict whether breast cancer is malignant or benign, [14] the authors used machine learning and computer vision to extract features and construct an optimized model by varying hyper-parameter values. The *Corresponding Author. analysis is carried out using a support vector machine, and several quality indicators are produced, with the observed findings being notable. [15] The goal of this study is to evaluate the classification algorithms' prediction accuracy in terms of efficiency and effectiveness. The authors conduct a rigorous comparison of classification algorithms such as SVM, DT, NB, and RF in terms of prediction accuracy utilizing WEKA and a 10-fold cross validation approach on the Wisconsin Diagnostic Breast Cancer dataset. In this study [16], two machine learning algorithms, Decision Tree Classifier, and Logistic Regression, were implemented for breast cancer prediction on the "Breast Cancer Wisconsin (Diagnostic) Data Set," and their accuracies were compared and the Decision Tree Classifier is the most suited-algorithm for prediction since it has a precise prediction accuracy.
Recently, researchers published a significant quantity of research utilizing machine-learning algorithms to diagnose breast cancer [17]- [20]. In this comparison study [21], 4 machine learning (ML) algorithms were used: DT classifiers, SVM, KNN, and RF, and the results show that Support Vector Machine has the highest accuracy of 97% among them for the classification of breast tumors in women. On the Wisconsin Breast Cancer dataset, [22] the authors examined five supervised machine learning algorithms: LR, RF, KNN, ANN, and SVM. The metrics of the confusion matrix are used to assess the study's performance. According to the results, the ANNs had the greatest accuracy score of 98.57 percent.
Furthermore, ontology has been one of the most widely used techniques to managing, organizing, and extracting data throughout the last few decades. It is a way of data representation that has been effectively utilized in a number of domains, particularly the medical domain. It is significant in computer science because of its ability to express many concepts and their relationships across fields. In reality, no single ontology is sufficient to meet today's expanding healthcare demands, and ontologies must be combined with machine learning algorithms to facilitate data integration and analysis. The authors in [23] created and explored an ontologybased decision tree model able to predict diabetes, [24] then compared the findings to numerous ML techniques, and discovered that the ontology model outperforms all other classifiers.
In this research, we intend to compare seven prominent classification approaches with the Ontological Model using carefully chosen criteria obtained from the confusion matrix, such as F-Measure, Accuracy, Precision, and Recall. The rest of this paper is organized as follows: Section II describes the methodologies utilized in this comparison analysis. Section III summarizes the findings and discussion. Section IV concludes and discusses future work.

II. METHODS AND EVALUATION
The approaches and materials employed, as well as the experimental methodology, dataset description, machine learning algorithms, ontology model, and evaluation metrics, are all included in this section. Fig. 1 depicts the process flowchart for this comparative study. A. Data Preprocessing The dataset used is Breast Cancer Wisconsin -benign or malignant from Kaggle website, it consists of 683 instances and 10 features (9 attributes and the last one is a target). A full description of all dataset attributes is provided in Table I. To build an effective machine learning classifier, we should always start with data cleaning, normalization of features, transformation of features, and even creation of new features from the dataset. The dataset contains 234 similar instances, after removing duplicated instances the remaining is 449 instances, where 213 represent benign cancer cells and 236 represent malignant cancer cells. We would like to inform you that in order to provide a fair comparison of the classification results obtained, we did not use any feature selection or performance-boosting methods.

B. Machine Learning Algorithms
We have used Weka software for all machine learning algorithms to predict whether the cancer cells are benign or malignant. Weka comprises tools for data classification, clustering, visualization, preparation, association rules mining, and regression [25].
We used the seven most classifiers used to classify datasets (Decision Tree, Random Forest, Logistic Regression, Artificial Neural Network, Naïve Bayes, Support Vector Machine, k-Nearest Neighbours). In addition, we employed two modes of test options: 10-fold cross validation and percentage split (split 50% train, remainder test) for the reason of enriching the study.

C. Ontological Model
This section presents the technologies used to create the ontology, besides the approach used to build the ontology model with the help of rules extracted from DT. This methodology has been referred to in this research for more details [26], which we recommend reading for more information. We'll go through some specifics shortly here.

1) Ontology construction:
The ontology was built using the Protégé software, which is an open-source platform that provides a set of tools to a growing user community for constructing domain models and knowledge-based applications with ontologies. The ontology was created manually; the main classes are Diagnostic and Patient. The graphical representation of the ontology is shown in Fig. 2.

4-adhesion
Marginal Adhesion: Adhesion loss is an indication of cancer.

5-epithelial
Single Epithelial Cell Size: Is connected to the previously mentioned uniformity. Significantly expanded epithelial cells may be cancerous cells 6-bare_nuclei Bare Nuclei: These are common in benign tumors.

7-bland_chromatin
Bland Chromatin: In benign cells, the nucleus has a homogenous texture.

8-normal_nucleoli
Normal Nucleoli: In normal cells, the nucleolus is generally quite tiny, if at all detectable. The nucleoli grow more visible in cancer cells.

2) Data properties and instances:
The data properties used in the ontology are the same attributes presented in Table  I which are used to build models of machine learning algorithms. Fig. 3 illustrates the data properties. A plugin among the Protégé software plugins called Cellfie is used to import the same dataset used in Weka.
3) Semantic web language rules and pellet reasoned: Following the creation of classes, data properties, and instances in the ontology. We need to establish the SWRL reasoning rules. To achieve this, we used the SWRLTab plugin, we retrieved created rules from the DT algorithm, and imported them into Protégé. The collected rules from the DT algorithm are converted using the Java programming language, with each leaf of the tree extracted as a single SWRL rule. For instance/

'^^xsd:decimal) ^ adhesion(?P, ?A) ^ swrlb:lessThanOrEqual(?A, '3'^^xsd:decimal)  benign
To execute SWRL rules and infer new ontology axioms we utilized another plugin from Protégé software named Pellet [27], which includes capabilities for checking ontology coherence, deals with SWRL rules, computing the classification hierarchy, deals with OWL, explaining inferences, and answering SPARQL queries. It uses the Ontology and SWRL rules to initiate the inference and then determines if the cancer cells are benign or malignant. The ontology classifier's results are reported in the next section.

D. Evaluation
ROC Area, F-Measure, Root mean squared error, Recall, Accuracy, Root relative squared error, Precision, Kappa statistic, and other performance measures are employed to assess ML algorithms. We employed two test modes (split-test and K-fold cross-validation) using several metrics including Recall, F-Measure, Accuracy, and Precision to analyze our experimental results, which are presented below and in Fig. 4. Furthermore, the same criteria are utilized to assess the validity of this comparison research including ML classifiers and the ontological model.   Tables II, III, and Fig. 5 illustrates the performance metrics of the ontology model.
The results of this study provide a visual representation of the various metrics that are used in this research, such as precision, F-measure, Recall, and Accuracy, as shown in Fig.  6-9. Table IV also shows the results of the various classifiers that were used in this research. Accuracy: The ontological model achieved the maximum value of 96.88% and Random Forest with rate of 96.00%, and 95.30 % for both Support Vector Machine and Logistic Regression in terms of 10-fold cross-validation, according to Fig. 6 and Table  IV. Almost the same results using split test mode, we obtained 96.00%, 95.10% for Ontology and Random Forest consecutively, and 94.60% for both Support Vector Machine and Artificial Neural Network.

Precision:
The ontology classifier has the highest Precision of 97.64% in terms of 10-fold cross-validation mode, followed by Random Forest and Naïve Bayes. Concerning split test mode, the highest Precision value of 97.00% goes for ANN. More details are shown in Table IV and Fig. 7.   Vol. 13, No. 7, 2022 Recall: According to Fig. 8 and  The experimental findings reveal that the ontology model has the highest accuracy of 96.9 %, followed by the Random Forest at 96.00 % and both Logistic Regression and Support Vector Machine at 95.30 %. In terms of the data stated above, we see no significant difference between 50%-Split and 10-Folds test modes. We conclude that the ontological model can aid by extending the scope machine learning model. They can comprise any data kind or variation, and each diver data can be assigned to a certain job. Combining the ontological model with machine learning may provide well outcomes. The ontological model achieves results that are comparable to machine learning classifiers. Humans may interpret the findings, and the rules can be modified or added as needed. Furthermore, it supports unstructured, semi-structured, and structured data formats, allowing for more seamless data integration. It can comprise all aspects of the data modeling process, starting with schemas at the most basic level. As a result, they can handle the massive amounts of data utilized as input for machine learning training or output as outcomes. Furthermore, ontology matches any organization's aim, which might be mathematical, logical, or semantic-based. To the best of our knowledge, this is the first comparative study of the ontological model and ML in which we have integrated the ontology with ML, especially in the area of breast cancer detection. As a result, no significant comparison can be done.

F-MEASURE
Folds-10 Split-50% 113 | P a g e www.ijacsa.thesai.org IV. CONCLUSION ML methods are widely employed in all scientific disciplines and have revolutionized industries all over the world. The use of machine learning techniques and algorithms in healthcare has recently advanced significantly. These approaches have shown success and may be valuable in the treatment of enduring diseases such as breast cancer. Furthermore, the Semantic Web has proven its usefulness and effectiveness in a multitude of areas, including health. As a Semantic Web component, ontology has the capability to treat concepts and relationships in the same way that humans view connected concepts.
In this research, we provided seven machine learning algorithms and an ontology model, as well as a comparison of their performance. Furthermore, two test modes are employed: 10-fold cross validation and percentage split, and several performance measures such as Accuracy, F-Measure, Precision, and Recall are employed to assess the outcomes. The findings show that the ontological model has the uppermost accuracy even when no feature selection is used. This brings us to a new search area, to which we advise and urge academics to participate and produce new insights in the same context, in order to provide additional outcomes and analysis, in order to make a forecast, recommendation, or decision, and so on. In future work, we want to improve this comparison analysis by adopting new ways to incorporate ML rules with the ontological model method, as well as regression machine learning algorithms.