Effective Service Discovery based on Pertinence Probabilities Learning

Web service discovery is one of the most motivating issues of service-oriented computing field. Several approaches have been proposed to tackle this problem. In general, they leverage similarity measures or logic-based reasoning to perform this task, but they still present some limitations in terms of effectiveness. In this paper, we propose a probabilistic-based approach to merge a set of matching algorithms and boost the global performance. The key idea consists of learning a set of relevance probabilities; thereafter, we use them to produce a combined ranking. The conducted experiments on the real world dataset “OWL-S TC 2” demonstrate the effectiveness of our model in terms of mean averaged precision (MAP); more specifically, our solution, termed “probabilistic fusion”, outperforms all the state of the art matchmakers as well as the most prominent similarity measures. Keywords—Service-oriented computing; web service discovery; rank aggregation; probabilistic fusion


I. INTRODUCTION
The web service technology is actually involved in many applications, such as business processes management and recommendation systems [1] Thanks to its modularity, composability and loose coupling, this technology is largely utilized in data integration and applications' composition. To ensure these objectives, one has to discover and rank the services that best meet her/ his needs. According to [2], the service discovery can defined as follows: Given a web service repository, and a query requesting a service (hereafter service query), finding automatically a service from the repository that matches these requirements is the web service discovery problem. Only those services that: 1) produce at least the requested output parameters that satisfy the postconditions, 2) use only part of the provided input parameters that satisfy the pre-conditions, and 3) produce the same side effects can be valid solutions to the query.
Several approaches have been proposed in the literature for tackling the web service discovery problem [3]. Based on the works of [4], [5], we distinguish three types of discovery approaches: logic-based reasoning methods, non logic-based techniques (i.e. similarity measures, graph matching, datamining, etc.) and hybrid techniques which merge the logic and the non logic solutions. Despite the progress made in this field, much remains to be done to achieve an acceptable rate of performance. For instance, the logic-based approaches are often characterized by a poor recall rate (Since the underlying semantic of service interfaces can be implicit and not captured by the ontologies) [4]. On the other hand, the similarity measures do not have the same performance; in addition, the choice of the most relevant similarity is not obvious and generally it depends on the actual user's request. Furthermore, a lot of similarity measures may have hyper-parameters (e.g. the fuzzy similarity proposed in [6]) that need to be adjusted for the search; therefore, arbitrary initialization of these parameters is inappropriate and may entail misleading results. Consequently, we must utilize both types of matching algorithms to enhance the discovery performance. In this line of thought, the creation of a hybrid matching algorithm must address the following concerns: 1) How to solve the ordering conflicts entailed by the individual matching algorithms (for instance an algorithm may conclude that service S 1 is better than service S 2 , while another may decide that S 2 is better than S 1 )? 2) How to infer the most suitable matching algorithm for each user's request, and exploit this knowledge in the fused scheme? 3) How to boost both recall and precision, while preserving a tolerable execution time?
In this paper, we handle the aforementioned difficulties, by adopting machine learning and the theory of probability as a clue for combining the individual matching algorithms.
More specifically, given the m rankings provided by the m matching algorithms (or similarity measures), our machine learning algorithm derives a global ranking by calculating a fusion score for each service S i ; this score is weighted sum of the scores (denoted as score ij where j is the identifier of a matching algorithm) provided by the matching algorithms. Each score ij represents the probability that S i is relevant to the current request; the more the value of score ij is high, the better the fusion score of S i . With this fusion scheme, we can answer the abovementioned concerns. In particular, the ordering conflicts are resolved using the weighted sum (which can be considered as weighted vote). Additionally, the most suitable matching algorithms are those that have a higher weight and a higher value of score ij (see equation 22). These heuristic will ensure a good performance in terms of recall and precision. Moreover, if we assume that the m matching algorithms are independent and have a precision equal to p (where p >= 0.5), then according to the theorem of jury [7], a majority voting method (or a weighted voting method) will achieve a precision higher than p. In summary, our proposed solution is can be described as follows: First, we divide each individual ranking into a set of segments. Second, for each segment, we compute its probability of relevance (i.e the probability of having a relevant segment member with respect to the current request). Third, we aggregate the aforementioned probabilities through a linear formula. To choose the ideal number of segments (ns) used in the second step, we perform a cross-validity that evaluate the mean averaged precision of the proposed model.
The remainder of the paper is organized as follows. In Section II, we review the state of the art. We formally define the problem in Section III. Section IV presents the probabilistic fusion algorithm. The results of the experimental study and threads to validity are presented in Section V. Finally, Section VI concludes the paper.

II. STATE OF THE ART
The web service discovery has received much attention in the recent years. In general, we discern three types of web service matchmaking approaches: logic matchmaking, non logic matchmaking and hybrid matchmaking [5], [3], [8].

A. Logic-based Matchmaking
The first category of matchning leverages pure logic reasoning, more precisely, the matchmaking utilizes the consistency tests or subsumption mechanisms to decide whether a relationship exists between the user request and the advertised service [9].
The work by [10] presents an automatic location of services (ALS) that allows for discerning five magnitudes of matching degrees (Match, ParMatch, PossMatch, NoMatch and PossPar-Match).
In [11], the authors enhance the framework proposed in [10]; in particular, they add additional magnitudes of matching degree such as: • RelationMatch: The advertised service does not meet the required outputs, but it offers outputs having a relation with them.
• ExcessMatch: The advertised service meets all the required outputs, but it offers supplementary outputs that are not needed by the user.
A logical matching framework is presented in [9]; this latter architecture takes into account almost all functional properties, including inputs, outputs, preconditions, and effects (IOPE).
-The major weakness of logic-based approaches are the high rate of false positives and false negatives [4].
In addition, the theoretical complexity of subsumption test is Pspace-complete or exp-time complete for certain portions of description logics [12].

B. Non-logic-based Matchmaking
Based on the fact that the aforementioned pitfalls discourage the research in this type of matchmaking, some scholars have developed a new type of solutions. These techniques [13] mainly leverage graph matching, data-mining, combinatorial optimization, and probabilistic matching.
The framework proposed by [8] matches the user's request against the OWL-S using the parameters of service name, service input, and service output. These attributes are first filtered using the part of speech (POS) procedure to eliminate the Stop Words, Special Characters, Numbers, and Uncategorized nouns. Then, the resulting terms are disambiguated using the Wordnet directory. At the end, these terms are matched using a Wordnet based similarity measure.
A new redescription of services is presented in [14]. The main idea consists of using dischlit probability distributions [15] and clustering [16] to provide a latent factorbased specification of services.
The iMatcher1 framework presented in [17] leverages the service profile to perform a syntactic matching of services ; more specifically, it uses four distance functions to match the request and the services (Term Frequency-Inverse Document Frequency [18], the Levenshtein similarity distance [19], the Cosine vector measurement [20], and the divergence measurement of Jensen-Shannon [21]).
In [22], the authors utilize fuzzy sets and rule based systems to tackle the web service discovery and selection problem. More specifically, the proposed work matches both capability attributes (functional aspect) and context attributes (non-functional aspect).
The work by [23] presents a collective dominance function to handle the QoS preferences of a set of users. This function is more flexible and enables the controle of the size of the service skylines.
In [24], the authors tackle the discovery of services. While taking into account the dynamic QoS properties. In particular, they leverage statistical time series to model the QoS fluctuations.
The work in [25] defines a composition framework by means of integration with fine-grained I/O service discovery that enables the generation of a graph-based composition which contains the set of services that are semantically relevant for an input-output request. The proposed framework also includes an optimal composition search algorithm to extract the best composition.
The work of [26] compare the semantic discovery approaches according to several criteria, such as interface type (e.g. OWL-S, WSMO), the scalability, the request expansion, the adopted similarity measure and the use of natural language processing.
The work by [27] proposes a two-stage discovery approach: an offline phase and an online phase. The input of the offline phase is a set of categorized services (most of the existing registries ensure this categorization (e.g. Programmable Web)). Each service is represented as a set of service goals. A service goal is a triple constituted of a verb, a core noun and optional parameters (such as adjectives and non-core nouns). The ensemble of service goals extracted from all services of each category are clustered into groups using K-means algorithm and Wordnet-based similarity measure.
In the online phase, the nearest category (with respect to the request) is retained and thereafter the user's request is expanded using the service goal clusters of the previous www.ijacsa.thesai.org category. At the end, the services of the target category are matched against the expanded query.

C. Hybrid Matchmaking
The third class aggregates the former categories in order to enhance the search quality. There are several ways for merging the aforementioned types: either by using machine learning or heuristics to tune the weights of the matching algorithms, or by using social choice theory to fuse the input rankings, or by leveraging probabilistic / fuzzy relationships to ensure the same purpose.
The most simple heuristic for merging a set of individual matching algorithms is to associate a fixed rank t (or priority) to each matching function.
The OWLS-MX framework [28] matches the inputs/outputs attributes of service profiles. This system proposes seven levels of matching degree (Plug-in, Subsumes and Subsumed-By) and hybrid matching (Logic-based Fail and Nearest-neighbor).
The work by [29] introduces a matchmaker for SAWSDLbased services. The approach leverages both subsumption test and information retrieval models for pairing the request and the advertised services.
The ISEM framework [5] is a hybrid matching approach that combines both the OWLS-MX3 filters and SVM-based learning for discovering services.
In [31] the authors develop three probabilistic functions for searching and ranking web services. Each function involves multiple matching algorithms (logic, textual similarities, etc.).
In the same work [31], the authors show a comparative evaluation which involves several voting models, such as CombSUM, CombMNZ [32], Borda-fuse model [33], and outranking model [34]. According to the experiments, the CombMNZ system is better than the other voting models, but it is less effective than some individual matching algorithms (such as information loss).
In [35], the authors adapt also the Condorcet fuse model [36] to the service discovery problem. More specifically, they compare the partial scores provided by the individual matching functions through a fuzzified version of the dominance relationship [6]. The preliminary results show that the proposed approach largely outperforms the individual algorithms. However, the results can be largely boosted if a smart parameter tuning is performed.
In [37], the authors introduce a new context-based solution based on QoS (Quality of Service) exploiting both functional and non-functional user's requirements and providing the user ability to control and proceed with the discovery of web services, i.e. the main aim of this work is to locate the appropriate web service correspondence with the context of the user.
In [38], the authors propose a multi-criteria decision method (MCDM) for searching web services based on contextual attributes (e.g. location, language, and size of screen). Since the standard similarity measures (such as Cosine and Extended Jaccard) are not suitable for handling contextual attributes, the authors propose a set of rules and a voting method to compare and rank services.

A. Introduction
In the following, we present a motivating scenario that highlights the major difficulties encountered in web service discovery. We assume that a given user is interested by a service which accepts a set of input concepts P in 1 , P in 2 ... and provides a set of output concepts P out 1 , P out 2 ..., (for the sake of simplicity we disregard for the moment, the other attributes such as preconditions or effects).
To achieve this purpose, the customer may utilize multiple matchmaking algorithms or similarity functions denoted by f 1 · · · f n . Each function is applied on the request/service parameters (in our case the inputs/ outputs). Let RQ be the request parameter set, i.e. RQ = RQ in ∪ RQ out , where RQ in = {P in 1 , P in 2 , ...}, RQ out = {P out 1 , P out 2 , ...}. Similarly we define the parameter set of the advertised service S as follows: AS = AS in ∪ AS out .
Each matchmaking function f j matches the request parameters against the parameters of the advertised services by applying the following equations.
Equations 1, 2 compute the similarity degree between the inputs (resp outputs) of the request and the inputs (resp outputs) of the advertised service. Table I shows two ranked lists produced by two matching functions f 1 and f 2 . Each cell labelled with score in or score out indicates a partial matching score computed through Equations 1 and 2. These matching scores belong to [0,1]. The aforementioned (individual) lists are ranked according to the mean score.
For the sake of simplicity, we suppose that all services have a single input Pin and a single output Pout, the same assumption is considered for the request. By analysing the previous table, we notice the following findings: First, the two rankings disagree about the ordering of the services A and B. Second, the resolution of the conflict by www.ijacsa.thesai.org computing the mean score over all partial matching scores (see the third line of each service) is not always a relevant heuristic. This solution may be erroneous for some user's requests.
Thus, the creation of an optimal ranking (which provides the highest precision and recall) is not obvious, since we must deal with the specificities of each request as well as the service position within each (individual) list.
As discussed above, each matching function is only effective on a subset of requests, and it may give a poor performance on the remaining requests. Consequently, it will be advantageous to combine a set of matching functions. By doing so, we leverage the advantages of the adopted matching techniques, and we boost the global performances.
To combine the individual matching algorithms, we have to aggregate the partial scores/ ranks of the services. Several aggregating schemes are proposed in the literature [28], [3]. These approaches may leverage voting based models, probability theory, fuzzy set theory, and machine learning.
To determine the most effective mechanism, we have to conduct an exhaustive comparative study and derive the optimal configuration of parameters.

B. Specification of the Discovery Problem
To facilitate the presentation of the problem, we assume the following notations: let P RL ij be a (partially) ranked list of the i th request under the j t h matching function. Formally: where, dataset is the collection of services (i.e S 1 , .., S |dataset| ) and V kij ∈ R d , each V kij represents a partial matching score computed through Equation 1 or Equation 2. It measures the similarity between the parameters of the i th request and the parameters of the k th service using the j th matching function. In this case, d is set to 2, since we have two descriptors for inputs and for outputs.
In the following, we specify the discovery problem as follows. Given: • A set of (partially) ranked lists, for each request {P RL 11 , ..., P RL m1 , .., P RL 1|Q| . . . ., P RL m|Q| } We aim to produce a combined ranking (denoted Combined Ranking i ) for each request RQ i , such that: MAP(Combined Ranking 1 , .., Combined Ranking |Q| ) is maximized. Where: Combined Ranking i : represents the fused list of the i t h request (RQ i ).
M AP : represents the mean average precision criterion. It is defined as follows: AverageP rec(Combined Ranking i ) and where precision(Combined Ranking i , k) is the precision at the k th position over the i th combined ranking. and rel(k) = 1 if the service S k is relevant to the i th request 0 Otherwise

IV. WEB SERVICE DISCOVERY AND RANKING
In what follows, we present our main contributions to solve this service discovery problem; in particular, we demonstrate the individual matching algorithms (Sections IV-A) as well as the probabilistic fusion scheme (Section IV-B).

A. Individual Matching Functions
In this work, we use the most promising matching functions of the information retrieval field. More specifically, we use five matching functions that are defined below. To match a request R with the service S, we introduce the following notation: Let RQ be the parameters set of R. let V ir (resp V or ) be the vector containing the occurrence numbers of the indexed inputs (resp outputs) of the request R. V ir is derived from RQ in ; similarly, V or is derived from RQ out .
In addition, let V is (resp V os ) be the vector containing the occurrence numbers of the indexed inputs (resp outputs) of the service S. V is is derived from AS in , similarly V os is derived from AS out . Based on the aforementioned vectors, we define the probability distributions P ir (resp P or ) and P is (resp P os ) as follows: www.ijacsa.thesai.org The first similarity measure is defined as follows: where cos measures the proportion between the dot product of the compared vectors (or objects) and the product of their length. It is defined as follows : and < X, Y > is the dot product operator, ||X|| is the euclidean norm of X.
Similarly, for V or and V os : where EJ (Extended Jaccard) computes the proportion between the size of shared elements and the cardinal of the union. It is defined as follows: Similarly, for V or and V os : where IL (Information Loss) is based on the percentage of elements that are not shared among the compared objects. The more the percentage is low, the better the similarity degree. It is defined for binary vectors as follows: Similarly, for V or and V os : sim4(R, S) = 1 2 (JS(P ir , P is ) + JS(P or , P os )) where JS (Jensen-Shannon based similarity) is based on the estimation of the difference between two probability distributions that represent the compared vectors. The more the difference is low, the better the similarity degree. It is defined as follows: where h(x) = −xlog2(x).
Similarly, for P or and P os : Where LOG (logic matching) is defined as follows : LOG(RQ out , AS out ) = M IN P l ∈RQout (LogM atch1(P l , As out )) In addition: LogM atch1(P 1 , AS out ) = M AX P k ∈ASout (LogM atch2(P l , P k )) In general, the logical comparison of two parameters P u , P t is established as follows: This is done similarly for AS in and RQ in .
In the following, we present our probabilistic fusion scheme, which is constituted of 3 algorithms. The first one, hereafter referred to as RPC (Relevance Probability Computation), computes the knowledge that allows the fusion of the input lists. The second one is termed PF (probabilistic fusion), it produces the TopK elements of the combined (fused) ranking. The third one is termed CVBT (cross-validationbased tuning). CVBT leverages the cross-validation to select the optimal number of segments.

B. Proposed Algorithms
To build the combined ranking, we adapt the probabilistic approach proposed in [39], to the context of web services. In a nutshell, the basic idea consists of learning a set of probabilities that are involved in the computation of the fused score of each service. The more the fused score is high, the better the rank is. The algorithm performing this task is referred to as RPC. Each learned probability (denoted by M RelP ri (S l )) represents the likelihood that a service S l returned in segment r is relevant, given that it has been returned by the matching function i.  • (Lines 1 up to 7), for each matching function i and request Q j , we compute the corresponding ranking termed ranking i .
• (Lines 8-9), we sort the aforementioned ranking and we get the relevant services of the request Q j .
• (Lines 10-13), for each segment r, we extract its members, thereafter we compute its relevance probability by applying the formula of the precision criterion. This rule calculates the likelihood that a segment r derived from the function i is relevant to the request Q j .
• (Lines [15][16][17][18][19], for each segment r and each matching function i, we compute their averaged relevance probability (also denoted M RelP ri ). More specifically, we take the mean of the relevance probabilities related to the requests of the learning set (SRQ).
• (line 22) We return the learned probabilities.
The second algorithm referred to as PF (probabilistic fusion) allows to compute a fused score for each service S l . To this end, PF leverages the learned relevance probabilities of the five individual rankings. PF is based on two heuristics (H1 and H2); which are summarized as follows: • The more the rank (or the segment identifier) of a service S l is higher within the individual rankings, the more the fused score is better (H1). • • The more the relevance probability M RelP ri (S l ) is higher, the more the fused score is better (H2). This rule is explained as follows: if we assume that M RelP ri (S l ) is large, then service S l is more likely to be relevant and, thus it should be ranked higher in the combined (fused) list. The fused score is summed up as follows: where r is the segment identifier of S l , i the identifier of the matching function, and m is the number of matching functions.
The pseudo code of PF is given in Algorithm 2. PF algorithm is explained as follows: • (Lines 1-4), we initialize the fused scores with 0, the fused (combined) ranking is also initialized with an empty list.
• (Lines 5-10) for each matching function i and the current request Q j , we compute the corresponding ranking termed ranking i .
• (Lines 12-14), for each service S l , we get the identifier of the segment in which he lies (Sid).
• (Lines 17-20), we create and sort the combined list, according to the decreasing order of the fused score.
• (Line 21): we return the T op K elements of the combined list.
In what follows, we present the third algorithm referred to as CVBT (cross-validation based tuning). This algorithm aims to select the optimal number of segments (denoted by ns) that ensures the best mean averaged precision of the combined rankings. We notice that ns ∈ {2, 3, ..., round(|dataset| /2)}.
To fulfill this goal, we use the cross-validation principle. This means that we firstly initialize ns to a given value, then we divide the requests collection on a set of parts (np parts). Thereafter, we perform the cross-validation section as follows: • We compute the relevance probabilities (i.e the RPC function) by choosing (np − 1) parts as the set of learning requests.
• We perform the probabilistic fusion (P F ) over the entire set of requests.
• We calculate the mean averaged precision (M AP ) that corresponds to the actual learning requests.
• We change the set of learning requests, by considering another union of (np − 1) parts, and we redo the previous steps • We take the average of the calculated M AP and we consider it as the final M AP associated to the actual ns (we denote this result as M AP ′ ). We iterate the previous process (the five steps) for all possible values of ns and then we choose the value that ensures the best M AP ′ .
In the following, we describe the pseudo code of CVBT (see Algorithm 3) • (Line 1), we divide the entire collection requests on a set of parts, for instance, if np = 5, then we have five subsets of requests.
• (Line 2), we initialize the optimal number of segments, as well as the optimal M AP .
• (Lines 7-9), for each iteration of the cross-validation, we initialize the learning requests (SRQ). The latter is constituted of (np − 1) parts of the entire collection. Thereafter we use this set (SRQ) to learn the relevance probabilities (M RelP ).
• (Lines 10-13), for each request Q j we produce the fused ranking CombinedListj, afterwards, we calculate the corresponding average precision.
• (line 14), we estimate the mean averaged precision of the actual SRQ set.
In the following, we show a scenario that illustrates the processing performed by the probabilistic fusion (i.e. RPC and PF). We also show a comparison with respect to the Borda [33] fusion scheme and other state of the art methods.

A. Evaluation Scheme
To assess the effectiveness and the efficiency of the proposed fusion scheme, we use the test collection OWLTC V2.2 1 . The latter contains real-world web service descriptions, extracted mainly from public IBM UDDI registries. As depicted in Table II, the benchmark contains: 1) 1007 service descriptions, 2) 29 sample requests, 3) a manually identified relevance set for each request.
This information allows the computation of recall and precision.
Since we set np to 5 (np is the number of parts), then 80% of the request set is utilized for learning the relevance probabilities. In addition, all requests will be used for evaluating MAP and some other metrics (recall@N, P rec@N, R − prec) defined below:  • R-Precision(R-prec or R-P): measures the precision after all relevant items have been retrieved [40].
• Precision at N (Prec@N): measures the precision after N items have been retrieved [40].
• Recall at N (recall@N): measures the recall after N items have been retrieved [40].
We also measure the average execution time of the probabilistic fusion, the Borda fuse model and the individual matching functions. Our algorithms have been implemented in Java and the experiments were conducted on a Core I3 1.80 GHz machine with 4GB of RAM, running on Windows7.
In Table III, we show the average execution time of the learning phase (RPC function), the fusion phase (PF function), and the total time. Since the aforementioned algorithms have a polynomial complexity, then they remain scalable for large services datasets.
In Table IV, we compare the performance of the probabilistic fusion with respect to the other approaches. We observe that all running times fluctuate between 21.000 and 29.000 Milli.Sec, except for Borda. The latter exhibits a performance around 700 ms. This is due to the fact that Borda is a simple sum of the service positions. We also notice that, the logicbased approach is more efficient than the other individual methods, because we implemented the subsumption test with a logic "or". This implementation is enabled by a binary encoding scheme inspired from [41]. By coding the ontology with binary words we significantly decrease the subsumption test cost.
Tables V and VI show the behavior of PF for both recall and precision. In general, we observe that the performance rises as the number of segments ns increases (for all values of K). We also notice that the best performance is provided by ns = 500.
According to Tables VII and VIII, we observe that PF is more effective than the remaining approaches. The PF results are achieved by setting ns to 500. As demonstrated in the experiments, PF largely outperforms the Borda fuse model. This is due to the fact that Borda is very sensitive to the services with bad individual ranks. Consequently its global performance is unsatisfying. On the other hand, we notice that the four similarity measures {Cos, EJ, IL, JS} have almost the same performance. The worst case is achieved by the logicbased approach.
The execution of CVBT is shown in Fig. 1. It depicts the www.ijacsa.thesai.org  relationship between M AP and ns parameter. In general, we distinguish two behaviours: firstly, when ns ∈ {2, .., 120} we observe a rapid improvement of the estimated MAP. Second, when ns ∈ 121, .., 500, we observe a slow improvement of MAP. The optimal value is reached round 500.
From these results we conclude that the more the segment size is small, the better the performances.
As depicted in Table IX, the R −P rec of PF is higher than the individual ranking algorithms as well as the Borda fuse model. In summary, PF produces a gain of 21% with respect to the highest individual R − P rec (i.e. the information loss R-Prec) and 28% with respect to the Borda R − P rec. Table X shows a comparison between the probabilitic fusion and the different systems that participate in the S3 contest 2009 2 . We notice that this competition is based on the same benchmark (i.e. OWLSTC.2). Table X clearly shows that our approach outperforms all existing matchmakers.

VI. CONCLUSION
In this paper, we have tackled the problem of retrieving and ranking web services. Our proposed framework takes into account multiple functional descriptors (input and output parameters) as well as several matching functions (logic reasoning and text similarities).
Simply speaking, our fusion algorithm leverages a set of relevance probabilities in order to infer an optimal fused ranking. These probabilities are largely dependent on the number of segments (ns). The setting of this regulating parameter is ensured by the cross-validation process.
The obtained results are very promising, and confirm the effectiveness of the proposed scheme.
In the nearest future, we aim to compare our approach with alternative fusion schemes, such as probabilistic dominance and majority-based voting. These approaches can be further enhanced by tuning their critical parameters with machine learning algorithms.