Affinity Degree as Ranking Method

—In machine learning, ranking is a fundamental problem that attempts to rank a list of things based on their relevance in a certain task. Ranking can be helpful, especially for future decision making. The framework for ranking has been classified into three primary approaches in machine learning: pointwise, pairwise, and listwise. However, learning to rank in all three approaches still lacks continuous learning ability, particularly when it comes to determining the degree of relevancy of ranking orders. In this paper, an affinity degree technique for ranking is proposed as another potential machine learning framework. The definition and attributes of the affinity degree technique are discussed, as well as the results of an experiment adopting the affinity degree approach as a ranking mechanism. The experiment's performance is measured using assessment metrics such as Mean Average Precision (MAP).


I. INTRODUCTION
Learning to rank is a machine learning framework that aims to organise things in a particular order according to preference and relevance. Due to its emerging use in domains like information retrieval (IR) and recommender systems, learning to rank has drawn the attention of many machine learning researchers in the recent decade. The main reasons for the machine learning framework for ranking shared the exact nature of classification and regression methods. Also, the machine learning method can tune the parameters to overcome the disadvantages in the IR model, such as low precision and rigidness [1]. Learning to rank can be another predictive analytic technique under machine learning that presents learning to rank approaches [2]. Thus, learning to rank can be categorised as supervised learning with training and testing phases [3] and solving evaluation problems in search relevancy ranks [4]. Similar to other machine learning frameworks, the performance of learning to rank models is measured using the loss function that computes the difference between prediction and ground truth [5].
Dong, Chen, Guan, Li, and Xu mentioned the issues of learning to rank as a lack of continual learning ability and complicated tasks to construct a large-scale and resourceful training set [1]. Falah also mentioned the deficiency of current learning to rank approaches as lacking continual learning ability [4]. Therefore, this paper aims to incorporate the affinity degree classification algorithm into the rank technique as part of the learning ability for learning to rank issues. Since the ranking methodology also used classification and regression to rate the variables, the affinity degree classification algorithm might better fit the ranking system. An affinity degree is a calculation for determining the degree of relationship and classification of the correlated data. Affinity degree has been established in peer-to-peer network data replication [6] as the calculation to define the similarity between two or more correlated data. The study used an affinity degree to find the correlation between files from different nodes. The correlation data is calculated to find the most binding factors contributing to the similarity between files. The results obtained from the calculation then will be ranked based on parameters. Therefore, affinity degree calculation has been implemented as one of the machine learning classification techniques in predictive analytic [7]. Thus, this paper will explore the affinity degree technique to rank as a machine learning framework.
The rest of the paper is organised as follows; Section 2 introduces the learning-to-rank theoretical background. After that, Section 3 describes more about affinity degree. Section 4 experiment for adopting the affinity degree into the learning-torank framework. Section 5 discussed the details of the experiment to validate the proposed idea. Finally, section 6 concludes the paper.

II. THEORETICAL BACKGROUND
In traditional IR approaches, machine learning techniques were booming for the ranking problem, in which the learningbased method aimed to use labelled data for practical ranking function [8]. Learning to rank encompasses mainly supervised algorithms where the method uses machine learning techniques to train the model in a ranking task. Learning to rank was successfully applied to defect prediction to rank modules based on their defectiveness in software engineering. In test prioritization, this method can rank test targets based on a testing objective [9].
Learning to rank can be categorised into pointwise, pairwise, or listwise. For pointwise procedures, the approaches formed the model from the score assigned by users to individual objects. The yield rank is a collection of records with conventional scores. There is no reliance between training reports since the training reports are utilised independently [1], [10]. The simplest form, pointwise ranking, can be treated as classification or regression by learning the numerical rank views of documents as an absolute quantity [11].
The pairwise procedure learns by comparing two training objects and their given ranks or ground truth [12]. Trained by training samples as object pairs with independent variables and learning the classification (regression) model, two records are doled out in each pair with two relevance scores by individuals.
(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 13, No. 3, 2022 404 | P a g e www.ijacsa.thesai.org Nonetheless, only the match report dependence is considered, which implies that dependence between each report within the total rank cannot be considered entirely [1], [11]. The applicability of such methods is limited by the high computational cost of pairwise comparisons of user rated items in generating the training samples for the binary classifier [10].
The third procedure, listwise approaches, learn from the list of records. The records are relegated to a query in each list with diverse pertinence scores. Typically, this approach optimises a smooth approximation of a loss function that measures the distance between the references list of ranked items in the training data and the ranked list of items produced by the ranking model [10]. One common advantage is that more reliance between records is considered than pointwise and pairwise models with unreliable flexibility [1]. Meanwhile, Hass points out that the pairwise and the listwise approaches usually perform better than the pointwise approach [12].
Finding a suitable algorithm for a specific data set is significant for extracting the best information. Therefore, comparing the algorithm and ranking them into order will help indicate which algorithm should be applied. For selecting the best algorithm for a problem given, Carlos presents combination techniques called Zoomed ranking, which analyses the given data set and compares it with the relevant data set that has been processed by an algorithm using the "distance" concept for calculation [13]. Also, Bradzil presented three ranking methods: average rank, success rate ratio and significance win for algorithm selection [14]. The ranking methods eventually were being evaluated by average weighted correlation measures.
The ranking system has several different frameworks besides machine learning. Thus, there are various studies about the application of machine learning in ranking challenges and the importance and advantage of ranking in the machine learning framework. Yongyao Jiang addresses the ranking challenge in geospatial data discovery and proposes a system architecture to combine existing search-oriented open-source software, semantic knowledge base, ranking feature extraction, and machine learning algorithm [15]. Results show that the machine learning approach outperforms other methods in terms of both precision at K and normalised discounted cumulative gain.
Besides, the importance of machine learning rank or learning to rank in the construction of the IR system has been pointed out in [16]. Because each query has a set of associated documents represented by feature vectors that reflect the relevance of the documents to the query, it is a goal to build a model to predict the ground truth label of test data as accurately as possible in terms of the loss function. Also, it can be used to explore multiple ranking algorithms across different approaches in the item of accuracy and efficiency. Also, Hong Li specifically discussed exploring the fundamental problems existing approaches and future work in learning to rank [13]. Document retrieval is a task where the system maintains a collection of documents. The system retrieves the query words from the collection, ranks the document and returns the topranked documents.
Although ranking systems are most common in the IR environment, recent studies prove that the system can be applied in different environments, such as the medical field. The ranking system was used for ranking the Multimodal Features extracted from Congestive Heart Failure (CHF) and Normal Sinus Rhythm (NSR) subjects. Use high ranking features for detection of CHF and normal subjects. The findings indicate that the proposed approach with feature ranking can be beneficial for automatic detection of congestive heart failure patients and can be very helpful for clinicians and physicians' further decision-making to decrease the mortality rate [17]. A case study from Iran in which A Rad used the AHP algorithm and data mining to cluster and rank university majors [18]. Also, in the data mining field, D. Scully proposes an effective and efficient combined regression and ranking method that optimises the regression and objectives simultaneously [19]. Koshti used the learning to rank pairwise approach to making faster and better decisions for recruiting football players and having a list of options ranked on given criteria [2].
Nevertheless, regarding the affinity definition that is proposed to be used in this paper as a ranking technique, there is a study presenting a novel ranking scheme, Affinity Rank, which utilises two metrics [20]. The focus of the study is to evaluate the diversity of information retrieval performance. Measures the topic coverage of a group of documents, and information richness, which measures the amount of information contained in a document. Although the affinity in the rank system is not entirely new, there are not many of them.

III. AFFINITY DEGREE METHOD
Affinity is a notion that has received widespread attention in domains such as chemistry, biology, physics, social networks, security, and computer science. Affinity can hold a different meaning based on various concerns. Here, affinity is defined as a relationship, similarity, dependency and closeness between variables. Following is the affinity notation used in data replication by Awang [6]. The study proposed combining popularity and affinity files as the most critical parameters in replica selection. Affinity files were defined as the similarity between two or more correlated files before the system replicated the file. The affinity set is a set of any data that creates an affinity between files. Thus, the affinity between sets A and B consists of the intersection of elements between A and B plus the target and is not a null set. The equation can define the target in set B as fid (B), where f is a file and id refer to the file id. (2) (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 13, No. 3, 2022 405 | P a g e www.ijacsa.thesai.org Definition 2: The affinity degree between A and B concerning A is defined as (2). The value expresses the degree of affinity between the data set A, and the affinity sets AB concerning A.

IV. EXPERIMENT
The main idea of affinity degree implementation is to measure the dependency or correlation between cause and particular effect. Measurement results might predict the set with the highest affinity degree as the leading cause of that effect. Therefore, this experiment focused on defining the risk of which symptoms can lead to a heart disease diagnosis. Through the affinity degree results, where the value of affinity degree was classified into five classes based on a specific indicator, the experiment could analyse the probability by ranking the affinity strength or assuming the correlation between dependent and independent variables.
This experiment was conducted according to KDD process in Fig. 1 [21]. Start with data selection, preprocessing and transformation data as the process results were shown in Table I, then affinity degree equation implementation. The ranking results were displayed in Tables II, III, IV and V by categories before the evaluation process.
In this experiment, MAP will be used as an evaluated method. MAP stands for the mean of the average precisions for each query computed. Average precision is computed as the sum of precisions for each found and relevant document, divided by the number of relevant documents. Using this construction, relevant but not found objects receive a precision of zero [22].

A. Heart Disease Data
The heart disease datasets used in this research were obtained from the Heart Disease Databases in the UCI Machine Learning Repository [23]. This data set dates from 1988 and consists of four databases contributed by the Cleveland Clinic Foundation (CCF), Hungarian Institute of Cardiology (HIC), Long Beach Medical Center (LBMC), and University Hospital in Switzerland (SUH), respectively. Each heart disease database has the same clinical instance format for each patient with 76 attributes, including the target attribute. It consists of 1025 patients with 499 patients ruled with heart disease while 526 were healthy. The target field refers to the presence of heart disease in the patient. It is an integer, valued at 0 or 1, indicating the absence or presence of coronary heart disease in patients. For other attributes, the integer, valued from 0 to 4, stated heart disease's absence, presence, and severity. Several risk factors can be controlled and cannot be controlled. The risk factors that can be controlled are blood pressure, blood cholesterol level, smoking, diabetes, obesity, inactivity and stress.
Meanwhile, a risk factor that could not be altered was age, gender, family history, and race. As part of preprocessing, this paper's attributes were compared to significant risk factors mentioned in the previous study for simplicity. Pre-processing focused on the handling of missing values, discretisation of numeric attributes and removal of instances with missing values [24]- [25]. Later, the attribute was compared with Hajar [26], Berg Gundersen, Sørlie and Bergvik [27], Mack and Gopal [28] and McClelland et al. [29]. For the age attribute, the age class was divided through class interval where the highest age minus the lowest age before was divided with the number of classes. Table I shows the reduced attribute details used in this experiment. Also, to get a better analysis, the data then were clustered into four categories: male older, male younger, female older, and female younger.
The experiment then implemented the adaptive equation in Section 3 defined as (2) into the data set. The affinity degree value then was ranked from highest to lowest displayed in Tables II, III, IV and V. The rank results show that the symptoms for each category were different. So, gender and age might greatly influence indicating the risk factor for heart disease diagnosis. For evaluating, this experiment used MAP as a tool, where the results will be discussed more in the next section.

V. EVALUATION AND DISCUSSION
The experimental results reveal a variance of affinity degree that shows the relations or correlation between data with various affinity degree values. Shown in Fig. 2, the affinity degree rank differs in mean average precision between each category, and the differences are just a small gap. With 0.39 for the male and older category, the second category for female and older had 0.40, 0.27 for the third category, male and younger, and the last category for female and younger, with 0.29 in value of mean precision. For overall mean average precision, the value is 0.34. Fig. 2. The Evaluation Result of MAP for Heart Disease Rank. www.ijacsa.thesai.org All the value for mean average precision in each category were less than 0.5. The number of instances in each category might influence the evaluation results. For example, in male and older category, there are 14 instance of presence and only 5 for absence instances. Therefore, the gap between these two instances were small. Same goes to male and younger category, although the total instances were 90, but the gap between two instances were only 6. The small gap between instances does influence the mean average precision calculation.
The affinity degree is calculated to determine the relationship between heart disease symptoms and the diagnosis. From the coronary heart disease data sets, all 1025 records of patients were taken for calculation purposes. The data set was clustered into four groups according to the patient's gender and age. From the affinity degree calculated in this experiment, the highest score of degree or rank can be the most potential attribute for the patient to be diagnosed with heart disease or not. The limitation in this experiment were the results are not verified as there is no domain expert were involved. In future, more experiments with with variance data volumes need to be done along with the domain expert verified the results.

VI. CONCLUSION
This paper implemented the notion of affinity as another alternative technique for the ranking system. Heart disease experiments with an enhancement of the affinity degree equation have been done. The experiment defines the strength of correlation or dependency between data then ranks them based on affinity degree value. The experiment was evaluated by the MAP method, which uses the mean of average precision to compute for a set of queries. The results have shown the potential of affinity degree as one of the rank techniques. More experiments for diverse data samples with larger data volumes could be used to validate and verify the equation in the future.

Thanks
to the internal grant of UNISZA (UniSZA/2021/DPU2.0/08) for financially supporting our work. Also, thanks to all team members for reviewing for spelling errors and synchronisation consistencies and for the constructive comments and suggestions.