A P System for K-Medoids-Based Clustering

The membrane computing model, also known as the P system, is a parallel and distributed computing system. K-medoids algorithm is one of the most famous algorithms in partition-based clustering algorithms, and has been widely used in data analysis and modern scientific research. Combining the P system with the k-medoids algorithm, the maximum parallelism calculated by the P system can effectively reduce the time complexity of the k-medoids clustering algorithm. Based on this, this paper proposes a cell-like P system with promoters and inhibitors based on k-medoids clustering, and then an instance is given to illustrate the practicability and effectiveness of the P system designed. Keywords—P systems; Clustering; K-medoids-based clustering; Membrane computing; Parallel and distributed computing


I. INTRODUCTION
Membrane computing [1,2], which is initiated by Pun in 1998, is a branch of molecular computing.The computing models in the framework of membrane computing, also called P systems, are distributed, non-deterministic and maximally parallelized [1].P systems are inspired from the compartmental structure and the way to process chemical compounds of alive cells, cells in tissue, organs, etc.Up to now, many variants of P systems have been investigated, mainly including cell-like P systems [1,3,4], tissue-like P systems [5][6][7] and neural-like P systems [8,9].P systems have been studied in many areas, such as biology [10], linguistics [11], computer science, mathematics [12], etc. [13].Many variants are universal computationally, likewise it has been proved that P systems have the computing capacity with the equivalent of Turing machine [14,15].Besides, more information about P systems can be found at the website of Ref [16].
Information plays an increasingly important role in modern society.Consequently, the issue of crucial importance is data analysis.Clustering is a basic and significant composition of data analysis, and it is employed as an ordinary method in modern science research [17].However, it is not come to an agreement with the complete definition for clustering.The classic one is described as: instances in the same cluster must be similar as much as possible, instances in the different clusters must be different as much as possible and measurement for similarity and dissimilarity must be clear and have the practical meaning [18].Generally speaking, Clustering algorithms are divided into traditional ones and modern ones, where traditional ones are based on partition [19,20], fuzzy theory [21], distribution [22,23], density [24,25], grid [26][27][28], graph theory [29,30], etc.It is applied crossing communication science, computer science, biology science, etc. Clustering is also introduced to membrane computing [31][32][33].For the data clustering problem, ref. [33] presents a novel clustering algorithm based on a tissue-like P system with loop structure of cells, called membrane clustering algorithm, to realize a local neighborhood topology, and proves the high efficiency and competitiveness of the proposed algorithm.To deal with self-driven clustering problem, ref. [32] proposes a membrane clustering algorithm based on a tissue-like P system with fully connected structure to solve how many clusters is the most appropriate and what does a good clustering partitioning look like at the same time.It develops an improved velocity-position model as evolution rules and proves the competitiveness of the propose algorithm either.Ref. [31] proposes the k-medoidsbased consensus clustering based on a cell-like P system with inhibiters and promoters by means of introducing k-medoids algorithm and cell-like P system with inhibiters and promoters to the consensus clustering, and it is proved to be highly accurate and highly efficient.K-medoids [34] is a melioration of k-means, and these two are the most famous ones of clustering algorithm based on partition.K-medoids deals with discrete data and designates the data point most near to cluster center as medoid.This method is more robust to noise and outliers as compared to k-means due to minimizing a sum of pairwise dissimilarities instead of a sum of squared Euclidean distances, but is suitable for small data sets because of larger calculation.However, the time complexity can be decreased by cell-like P systems with promoters and inhibiters, because it has the inherent mechanism of parallel and distributed computing.In short, the maximum parallelism of P systems contributes to improving algorithm efficiency of unsupervised learning.
In this paper, a P system Π kmbc is proposed to implement a kind of k-medoids-based algorithm which is modified mildly to adjust with the evolution mechanism of P Systems to achieve clustering.The paper is organized as follows: Section II introduces the basic knowledge of the k-medoids algorithm and cell-like P system with priority and promoters; Section III proposes the design of the P system Π kmbc with its kmedoids-based algorithm, definition and rules discussed in detail.Subsequently, an instance is given in section IV; and the conclusions are drawn in Section V.

A. The k-medoids algorithm
The k-medoids algorithm proposed in 1987 [34] is a classical partitioning algorithm of clustering related to the k-means algorithm.Both the k-medoids and the k-means algorithms are to cluster the data set of n objects into k (a known priori) clusters and to minimize the distance between points in the same cluster and a point called medoid which is designated as the center of that cluster.A medoid is a most centrally located point in the cluster.
As the K-Medoids algorithm is improved by the K-Means algorithm, it is partitioning around medoids instead of means.The K-Medoids algorithm is more robust to outliers and noise than the K-Means algorithm due to choosing medoids as centers and minimizing a sum of pairwise dissimilarities.However, the same characteristic is that the actual definition of the distance has various alternatives according to the requirement of actual problems.The smaller the sum of distances between each two data points is, and the more similar the data points in same cluster are, the more dissimilar the data points from different clusters are, the better the clustering result is.
The most representative realization of k-medoids algorithm is the Partitioning Around Medoids (PAM) algorithm.PAM uses a greedy search which may not find the optimum solution, but it is faster than exhaustive search.It works as follows [35]: 1)Initialize: select k of the n data points as the medoids.
2)Associate each data point to the closest medoid.
3)While the cost of the configuration decreases, repeat step 4).4)For each medoid m, for each non-medoid data point o, repeat step 5) to 6).5)Swap m and o, recalculate the sum of distances of data points to their medoid.6)If the total cost of the configuration increased in the previous step, undo the swap.
Algorithms other than PAM have also been suggested in the literature [36,37].Voronoi iteration method is included, which is more simple and faster than PAM.The steps of Voronoi iteration method are as follows: 1)Initialize: select k of the n data points as the medoids.
3)In each cluster, make the data point that minimizes the sum of distances within the cluster the medoid.4)Reassign each data point to the cluster defined by the closest medoid determined in the previous step.

B. Cell-like P System with Priority and Promoters
There are many variants of P systems already introduced in section 1.This paper is only related to cell-like P systems with priority and promoters.Therefore, this piece gives the basic concepts about cell-like P systems with priority and promoters.
There are three main components to cell-like P system: the membrane structure, objects and rules.As suggested by Fig. 1, the membrane structure is a hierarchically arranged set of membranes which are usually identified by labels from a given set and divide a cell-like P system into separated regions.A membrane which does not contain any other membranes is called elementary.The membrane which contains all the other membranes is referred as the skin.Each Membrane only determines a region bordered above by itself and below by the membranes placed directly inside, if any exists.Rules are only effective in the region of the membrane they belong to.Objects which are expressed by characters or string of symbols can evolve to new objects or be transferred to new regions Fig. 1.The membrane structure of cell-like P system [38] according to the rules in the membrane whose region objects appear in.
Formally, a cell-like P system (of degree m ≥ 1) with priority and promoters is of the form where, 1)O is the alphabet of objects used in Π; 2)µ is the membrane structure of Π, and degree is m; , where ω i is the multiset including the initial objects in membrane i, object λ (an empty string) means there is no object in membrane i, R i is a set of rules in membrane i, promoters present in rules, ρ i specifies a priority relation among the rules in R i , smaller value means higher executing priority; 4)i out is appointed as output region which saves the results.
The form of rules in Π is: (u → v) or (u → v| α , ρ) or (u → (w) i ), where, u∈O + , v∈(O × T ar) * , α is promoter.O * is the finite and non-empty multiset over O, O + = O * − {λ}, T ar = {here, out, in i |1 ≤ i ≤ m}.Objects appear in u will be consumed.If v appears in form of (a, here), a will remain in the membrane where the corresponding rule is applied.If v appears in form of (a, out), a will become an object of the region immediately outside the membrane where the corresponding rule is applied.If v appears in form of (a, in i ), a will be produced in membrane i.Object δ appeared in v means dissolving the membrane where δ presents in and releasing all objects in this membrane to the region immediately outside this membrane.The second form means executing u → v with the priority of ρ when α exists in the membrane where the corresponding rule is applied.Priority and promoters can both appear in this form or only one.The third form means creating a membrane labeled i and adding w to the multiset in membrane i.
Rules are executed according to the principles of nondeterminism and maximal parallelism in each membrane.Only several rules are chosen non-deterministically when more rules can possibly be applied.All rules that can be applied must be applied concurrently.These two principles are limited by reactant in a membrane.As only dealing with cell-like P systems, the rest of this paper refers to the cell-like P system as the P system for brevity.

III. THE DESIGN OF P SYSTEM Π kmbc
This paper aims to obtain a P system Π kmbc for clustering based on k-medoids method.The algorithm for Π kmbc is discussed before the definition of Π kmbc is designed.
The algorithm for Π kmbc is modified from PAM and Voronoi iteration method which are mentioned in subsection A in section II.
Suppose T = {p 1 , p 2 , , p n } denotes a dataset with n data points which can be multi-dimensional vectors.This paper supposes them two-dimensional, and names the input data set where a i and b i represents the number of x i and y i respectively.All data points are divided into k (k≤ n Each data point belongs to and only belongs to one cluster.Default medoids are first k data points of T kmbc .
According to subsection A in section II, distances between un-medoid data points and medoids need to be calculated repeatedly.In order to reduce calculation amount and realize in P systems, the algorithm for Π kmbc introduces definitions of the distance matrix, and the point-medoids distances set, the pointpoint distances set, the sum of the point-point distances set.In addition, this paper considers squared Euclidean distance.
Suppose D nn the distance matrix: where d i,j is the distance between p i and p j .
Suppose D i the point-medoids distances set which contains k distances associated with p i and k medoids, and it meets: where S m is a set of k numbers which correspond to k subscripts of k medoids, D is a set whose objects equal to all the nonzero objects of D nn .Suppose D i the point-point distances set which contains distances associated with p i and all the other data points belonged to the same cluster with p i , and it meets: where, S cm is a set of numbers which correspond to subscripts of all the data points in cluster It uses unique data point mark ε i to replace p i for convenient.Therefore, unique data point marks are assigned to clusters instead of data points benefited from introducing all the definitions above.
Suppose z the number of iterations, and it is assigned artificially.
Based on PAM and voronoi iteration method, all the definition introduced, the algorithm for Π kmbc is mainly composed of initialization, initial assignment and iterative assignments.The algorithm flow for Π kmbc is shown in Fig. 2. The detailed flow of the algorithm is described as follow.
The algorithm for Π kmbc computes D nn and select first k of the n data points as the medoids at first.It compute D i for Fig. 2. The algorithm flow for Π kmbc each p i in T kmbc and associates each data point mark to the closest medoid according to the minimal item in D i next.For un-medoid data point p i , this algorithm elects the minimal item d i,j or d j,i in D i , then assigns ε i to the cluster whose medoid is p j .Then iterative steps go.The first step of iterative steps is to compute D i and d i for each data point and to designate the new medoid according to the minimal d i in each cluster.That is to elect theS p i which corresponds to the minimal d i as the new medoid in each cluster.The second step of iterative steps is to reassign each data point mark to the cluster defined by the closest medoid determined in the previous step.Moreover, the implement of assignments in iterative step are similar to that of the initial assignment.Due to the mechanism of parallel and distributed computing, all data points are processed in parallel in this algorithm.

B. The definition of Π kmbc
Based on the algorithm for Π kmbc which discussed in subsection A in section III, the definition of P system Π kmbc is figured out and as follow: where, e, z, α, α , α , β, γ, δ, η, θ, λ, µ, ξ, π, ρ, σ, ω}.This multiset includes objects which are related to data points, which control the application of the rules, and which are special in P system (such as δ, λ, s and ω).The most important objects are d i , d i,j , ε i , ε i,j and t i,j .Objects d i denote the sum of distances between data point p i to all other data points in the same cluster.Objects d i,j denote the distance between data point p i Fig. 3.The initial membrane structure of Π kmbc to the medoid whose subscript is j.Object ε i which is unique in Π kmbc denotes data point p i or medoid p i (and then it is called data point mark or medoid mark).Object ε i,j is designed to record the information that cluster i disignates the data point whose subscript is j as the medoid.Object t i,j is designed as transmission mark to record the information that data point p i belongs to the same cluster with the medoid whose subscript is j.Other objects are designed to ensure the integrity of data information or as flow controller. 2 A is the initial membrane structure of Π kmbc as shown as what Fig. 3 illustrates, and it will change with the evolution of Π kmbc .As suggested by Fig. 3, membrane A is the skin which contains all other membranes.The class membranes of B i are designed as clusters in which data point marks distributed to them are placed and new medoids are computed.They also are storage areas of terminal results.Membrane C is designed to deposit the distance matrix.The class membranes of E i are designed to compute the point-medoids distances set and decide where unassigned data points will belong.One membrane of E i corresponds with one unassigned data point, and they have the same subscript.With the evolution of Π kmbc , the class membranes of D i,j and F j are dynamically generated in membrane A and the class membranes of B i respectively.The class membranes of D i,j compute the distance matrix and are dissolved after computation.One membrane D i,j corresponds with one nonzero element in the distance matrix, and they have the same subscript.The class membranes of F j compute the point-point distances set and the sum of the pointpoint distances set and are also dissolved after computation.The delivery of objects between other membranes is achieved in Membrane A. At the same time, information transferring or calculation flow comes true.
R mem and ρ mem are described in hand as R mem for convenience, and more details about R mem are in subsection C in section III.
In addition, child membrane that dynamically generated in father membrane inherits all rules in father membrane.This paper eliminates the influence of inherited rules on child membrane when designing all the rules in Π kmbc .As a result, rule sets of dynamically generated membranes only contain uninherited rules for concision.The rules in Π kmbc are elaborated in next subsection and calculation processes too.

C. The rules in Π kmbc
For the sake of being understand, the rule sets R mem in Π kmbc are explained in order of operation flow.Of these, ρ mem whose value is 1 or 2 or 3 specifies a priority relation among a rule set of R mem , and smaller value means higher executing priority.

1)Initialization a)Preparation
In P system Π kmbc , the rules in R A associated with the preparation are: Rule of type r 1 is used to start the system, rules of types r 2 -r 9 is used to create the class membranes of D i,j which are to compute D nn for preparing initialization and to move objects which are data objects and flow controllers to these membranes.After rules of type r 2 and r 4 are applied, each membrane of type D i,j is created finished and contains objects who has the same subscript with the first subscript of itself.After rules of type r 6 and r 7 are applied, each membrane of type D i,j contains objects who has the same subscript with the second subscript of itself.Rule of type r 8 moves specific flow controllers to membrane D i,j .Other rules assist in process control.

b)Compute and deposit D nn
Rule set R Di,j is associated with computation of D nn , and each subscript i and j in R Di,j equals to the first and second subscript of membrane D i,j respectively: With the specific flow controller, rules in membrane D i,j start being executed.Rules of type r 4 -r 9 are used to turn the differences of the first and second dimensions of the two data points to objects type of c and d.After rules of type r 12 -r 17 are applied for limited times, each membrane of type D i,j contains objects d i,j with the quantities of squared c and squared d.
Rule of type r 21 means dissolving the membrane and send a specific flow controller which reflects the end of computation out.
The rule in R A associated with depositing D nn moves objects of type d i,j to membrane C: r 10 : The rules in R A associated with the initialization of k medoids are: After rule of type r 11 is applied, each membrane B i contains a data point mark who has the same subscript with the membrane.Other rules assist in process control.

2)Initial assignment
Rules of type r 10 -r 12 in R A are used to move objects associated with computing D i and the minimal item in D i to membrane C .After that, rules of type r 1 -r 6 in R C are used to copy these objects into appointed membranes of type Thus, each membrane of type E i contains k medoid marks and k objects of type d i,j or d j,i which are all the items of D i .Rule set R Ei is associated with the computation of the minimal item in D i , and each subscript i in R Ei equals to the subscript of membrane E i : When p i is a un-medoid data point, rules of type r 4 and r 5 in R Ei are used to consume d j for all j existed in the membrane until there is no d j left for a certain j, then turn the ε j corresponding to the certain j to t i,j .As consequence, transmission mark t i,j means that un-medoiod data point p i belongs to the cluster with the medoid p j , and rule of type r 9 is used to move t i,j out.Furthermore, rules of type r 7 -r 11 in R Ei ) are used to empty the objects in the membrane for the convenience of the next computation.
When p i is a medoid, there is no need to compute, but some objects are contained in membrane E i after the execution of rules of type r 2 -r 5 in R C .For the convenience of the next computation, rules of type r 3 , r 6 , r 7 and r 11 in R Ei are executed.
As transmission marks are contained in membrane C, rules of type r 7 -r 10 in R C are used to move them out: Rule of type r 14 in R A is used to send a flow controller of type α to each membrane B i : With the flow controllers, k marks of type ε i,j which reflect that the ith cluster has a medoid with subscript j right now are copied into membrane A due to the execution of rule of type r 1 in R Bi : Finally, there are objects of type t i,j and ε i,j in membrane A. That is to say, all the objects for initial assignment are completely prepared.Rules of type r 15 -r 18 in R A are associated with initial assignment and preparation for the following step: Rule of type r 16 is used to send data point marks to the certain membranes of type B i under the premise of t i,j and ε i,j .As a result, initial assignment is achieced.Meanwhile, D nn and flow controller β are send to each membrane B i due to rules of type r 16 and r 18 .
3)Iterative assignment a)Update medoids As membranes of type F j are dynamically generated in each membrane B i , the rules in R Bi associated with it are: r 2 : ε j → (ε j , here)(t j ) Fj | β , 1, 1 ≤ j ≤ n; r 3 : β → λ, 1; r 4 : d p,q → (x p,q , in all )| γ , 1, 1 ≤ p ≤ n, 1 ≤ q ≤ n; r 5 : γ → (σ, in all )(η, here), 1; r 6 : η → θ, 1; With the flow controller β, rule of type r 2 is used to generate membranes F j each of which corresponds to a data point in the cluster.Rule of type r 4 is used to move D nn and a flow controller σ to all the membranes F j .Therefore, each membrane F j contains all the objects associated with computing D i and d i .
Each subscript j in R Fj equals to the subscript of membrane F j , and all rules in R Fj are as follow: In each membrane F j , rules of type r 1 -r 2 turn objects of type x ij to d j where the first or second subscript of x ij and the subscript of d j equals to that of F j .Thus, the quantities of d j equals to d i which is the sum of D i .Rule of type r 5 is used to dissolve the membrane and release objects of type d j .
Rules of type r 7 -r 8 in R Bi are executed repeatedly to figure the object ξ j which points to the minimal d j : That is to say, p i whose subscript equals to that of ξ j will be the new medoid.The rest rules in R Bi are: r 9 : d j → λ| µ , 1, 1 ≤ j ≤ n; r 10 : ξ j → ε j | µ , 1, 1 ≤ j ≤ n; r 11 : ε j → λ| µe , 1, 1 ≤ j ≤ n; r 12 : eµ → π, 1; r 13 : µ → π, 2; r 14 : πε j → (ε j , here)(πε j e, out), 1, 1 ≤ j ≤ n; When the number of iterations left does not equal to zero, rules of type r 9 -r 13 are executed.Rules of type r 10 -r 11 are used to designate the new medoid and eliminate other data points in the cluster.Rule of type r 12 is used to reduce the number of iterations by one.
When the number of iterations left equals to zero, rules of r 9 -r 10 and r 13 are executed.Rule of type r 10 is used to designate the new medoid.Each membrane B i contains data point marks when system stops because rule of type r 11 is not executed and there is no iterations left.
No matter if the number of iterations left equals to zero or not, rule of type r 14 is executed to copy the new medoid out and send object e which means one iteration.

b)Reassignment
The rest rules in R A are: r 23 : e k → e, 1; r 24 : e z → ω, 2; r 25 : ω → (ω, out), 1; When the number of iterations left does not equal to zero, rules of type r 20 -r 23 are executed.Rules of type r 20 and r 22 are used to move new medoid marks and the flow controller to membrane C.Then, reassignment gets start with flow controllers playing their roles and the specific flow of reassignment is identical with that of initial assignment.
When the number of iterations left equals to zero, rule of type r 24 reaches the conditions of usage, and rules of type r 20r 25 are executed.Rule of type r 24 is used to create ω which has the meaning of end of the system.Rule of type r 25 is used to send ω out membrane A to end the system, and the execution of rules of type r 20 -r 22 does not impact the clustering results.

D. Complexity Analysis
In this subsection, the time cost in the worst case of Π kmbc is analyzed according to the algorithm flow and the operation flow of rules.It is assumed that executing a rule costs a slice.

V. CONCLUSION
This paper proposes a P system Π kmbc based on k-medoids to achieve clustering in a shorter time.The algorithm for Π kmbc is modified to fit this cell-like P system with promoters and inhibitors.In this P system, a hierarchically arranged