Remote Sensing Satellite Image Clustering by Means of Messy Genetic Algorithm

Messy Genetic Algorithm (GA) is applied to the satellite image clustering. Messy GA allows to maintain a long schema, due to the fact that schema can be expressed with a variable length of codes, so that more suitable cluster can be found in comparison to the existing Simple GA clustering. The results with simulation data show that the proposed Messy GA based clustering shows four times better cluster separability in comparison to the Simple GA while the results with Landsat TM data of Saga show almost 65% better clustering performance. Keywords—Genetic Algorithm: GA; Messy GA; Simple GA; clustering introduction


I. INTRODUCTION
As unsupervised image classification based on statistical methods, quantification theory and clustering are mentioned as typical methods [1]. Clustering methods can be broadly classified into hierarchical clustering, which treats pixels as individuals and classifies sets of individuals hierarchically, and non-hierarchical clustering, which divides a set of individuals at a time into a certain number of divisions [2]. The latter method is a method in which the initial cluster is given, the cluster to which the individual belongs is determined based on the distance between the cluster and the individual, the cluster centroid is obtained, and the individual is rearranged. The former differs from the latter in that clusters are formed based on the distance between individuals, between individuals and within and between clusters without giving an initial cluster [4]. In general, the latter method is frequently used because of faster convergence. In particular, when relocating, k clusters, i.e., k-means method [5] and ISODATA (Iterative Self Organizing Data Analysis Techniques A) [3] is famous.
In any clustering method, when n individuals are divided into k clusters, there is no guarantee that an optimal division result will be obtained. The latter can also be considered as a kind of optimal combination problem, and one of the effective methods is a Genetic Algorithm (GA) [6]. Clustering by GA, which effectively uses the effects of stochastic search and learning, is a method of improving the division by giving the evaluation criteria and initial division of the cluster. As a conventional method, there is clustering by Simple-GA [7].
However, since the location of the Simple-GA on the chromosome coincides with that on the chromosome, it is likely that the schema is superior or inferior depending on the location on the chromosome, and that the long schema is likely to be destroyed. On the other hand, Messy-GA is a variable-length list structure in which the chromosome-one gene expression is called a codon (locus allele). Therefore, it is unlikely that the schema will be superior or inferior due to the correspondence between the locus and the position on the chromosome, and the long schema can be expected to be preserved [7]. When applying the genetic algorithm to image clustering, cluster numbers are assigned by random numbers according to the pixel array, and the cluster number array (schema) effective for maximizing the fitness function is saved, crossed over, and mutation is performed. Probably searches for the optimal cluster while waking up, but originally the image has high spatial correlation, so the cluster to which the target pixel belongs is likely to match the cluster around it.
The author proposed a clustering method that takes such contextual information into account [8]. This paper further proposes a method using Messy-GA to guarantee schema preservation. The author takes up the degree of separation between clusters as the clustering accuracy, evaluate it using simulations and real satellite images (Landsat TM), and confirms the effectiveness of the proposed method.

II. PROPOSED METHOD
The author proposes clustering using Messy-GA in comparison with Simple-GA.

A. Messy GA Clustering
The schema length of the Simple GA is fixed. Therefore, relatively long schema which is effective for cross over is used to be broken. Consequently, it is difficult to find the most appropriate solution of chromosome. Meanwhile, cross over is much effective for Messy GA due to the fact that all the possible chromosome of maximum length can be prepared because the chromosome length is variable together with list of structural representation of chromosome. (1) Coding of chromosome, then initial pair of pixels number and cluster number is se.t (2) Fitness function evaluation.
(3) Initialization. (4) Primordial phase. (5) Juxtaposition phase When the iteration number and the data number is exceed the threshold, all the pixels are assigned to cluster number.

B. Chromosome-Genotype Expression
Gene expression representing the state of dividing n individuals into k clusters is performed as shown in Table I.
That is, a pixel is defined as an individual, and its cluster number is defined as a gene. Genes are arranged according to the order of pixel arrangement, and GA is used in an algorithm www.ijacsa.thesai.org for stochastically searching for an optimal cluster of the pixel. At this time, the optimal cluster (remains) under the condition to maximize the fitness function shown as follows.
Gene sequence or a partial sequence of a certain part of the chromosome and another partial sequence (schema) at a certain probability, or by causing a mutation at a certain probability.
Find the best cluster. In the case of Simple-GA, since the schema description is fixed length, even if a valid schema for maximizing the fitness function can be searched, it is highly likely that it will be destroyed, but Messy-GA since the description of the schema is variable, the schema determined to be valid can be stored. This mechanism is shown in Fig. 1.

C. Fitness Function
The coding of the chromosome representing the division state in which n individuals are divided into k clusters is performed as follows: where, assuming that the length of the chromosome in Simple-GA is 1, and allow a variable length. xi1 ... xin indicates a locus, and vi1 ... vin indicates an allele value at the locus.
A fitness function is defined by equation (3).
where ST: sum of squares of n individuals, SW (k): sum of square sums in a cluster of k clusters, SB (k): sum of inter-cluster sum of squares of k clusters, The n individuals are divided into k non-empty, mutually exclusive clusters.

D. Selection / Selection Operation
The same number of chromosomes as the population of the previous generation are selected from the population of the previous generation by using the expectation strategy using uniform random numbers and elite preservation strategy according to the fitness. At the same time, selection is performed 1 . As an example, the decrease in the expected value in the expected value strategy is 0.75. Gene(Cluster_No.) C0 C1 … Cn-1 1 As an example, the decrease in the expected value in the expected value strategy is 0.75.

E. Crossover
The crossover operation performs multipoint crossover using dominant inheritance as a model. According to the crossover probability, cross-symmetric chromosomes are selected from the chromosomes selected in the selection and selection operation, and crossover is performed in the selected order.
At the time of crossover, the genotype mismatch between the two chromosomes occurs. Therefore, the reference locus is selected from the chromosomes selected as the crossover target using uniform random numbers. Based on the selected reference loci, the alleles are replaced according to equation (7), and the genotype matches.
In this way, multipoint crossover is performed on chromosomes with unified allele types. However, actual allele replacement is performed only when the fitness of the replaced chromosome is improved. The chromosome where the allele replacement is performed updates the fitness every time the replacement is performed 2 . As an example, the crossover probability is 0.6.

F. Mutation Operation
According to the mutation probability, the chromosomal locus causing the mutation is randomly determined, and the allele is determined using the uniform random number at the determined chromosomal locus. If the fitness is improved when replacing with the determined allele, allele replacement is performed 3 . As an example, the mutation probability is 0.03.

G. Convergence Condition
The initial division of the cluster is set to 0 generation, and updating of the set generation is used as the program termination condition 4 . As an example, the end setting generation is 300,000 generations. www.ijacsa.thesai.org

III. EXPERIMENT
The conventional method and the proposed method were applied to the simulation and actual satellite images, and the respective clustering accuracy was evaluated. Here, to show the superiority of the proposed method, the parameters of GA in the conventional method are the same as those of the proposed method.

A. Simulation Parameters
GA parameters are as follows. The simulation data creation parameters are as follows. Cluster individuals were generated by improving Neyman-Scott's method [9].

 Distance between clusters 4σ
A population of 100 means an image of 10 × 10 pixels. 900 kinds of simulation image data were generated by changing the initial value of random numbers. Here, the distance between clusters i and j is shown in equation (8).
where Di, j is the covariance matrix of i, j of the cluster, | Di, j | is its determinant, and n is the number of dimensions.
Normal random numbers were generated sequentially by giving the average and standard deviation, and constrained by the distance between clusters to construct a pixel array. Fig. 2 shows an example of the generated simulation image data. From the left, bands 1 and 2 of cluster 1 and bands 1 and 2 of cluster 2 are shown.

B. Simulation Results
Simulation images were generated to the extent that they could be considered as statistics (900 here). Clustering was performed by Simple-GA (SGA) and Messy-GA (MGA), and the degree of separation between clusters represented by equation (8) was evaluated. The cluster result images of Simple GA and Messy GA clustering are shown in Fig. 3, 4, respectively. Here the author shows only two of the 900 trials. At this time, Table II shows the number of generations up to convergence and the final degree of separation between clusters. Fig. 5 shows an example of a change in the degree of separation between clusters (learning process).
Here, TBfitness and generation indicate the degree of separation between clusters and the number of convergent generations, respectively. In the figure, the broken line represents the learning process of SGA, and the solid line represents the learning process of MGA. All genotypes with deceptive order-length building blocks at the initialization stage because the chromosomes characteristic of MGA are of variable length and the chromosomes are composed of a list of loci and allele pairs can be generated and only genotypes with valid building blocks can be left to posterity. In addition, gene lists can be exchanged on the chromosome, and this optimization learning takes time. Chromosomes are significantly different from SGAs, which are represented by a fixed length.  The Landsat TM (Thematic Mapper) image around Saga city in March 25 1986 was used as an actual satellite image. Fig. 6 shows the location of intensive study area (Red circle) in the Google map.
From the image, a portion of 32 x 32 pixels in height and width is extracted as shown in Fig. 7. Landsat TM has the following seven spectral bands, including a thermal band:   Ground Sampling Interval (pixel size): 30 m reflective, 120 m thermal. 6 band data (excluding thermal band) are used for clustering. 5 clusters (urban, road, soi1, water, paddy) are assumed. Therefore, the number of clusters are set at five.
True color image of a portion of Landsat TM image which is acquired on March 25 1986 is shown in Fig. 8. Fig. 9(a) shows only band 1 among the images used at that time. The resulting images by SGA and MGA clustering are shown in Fig. 9(b) and (c), respectively.
The clustered image of SGA shows noisy while that of MGA shows relatively smooth. In particular, nevertheless Ariake Sea area has to be clustered as one cluster, SGA result shows not only water body but also base soil, rice paddy, etc. Meanwhile, MGA result shows comparatively reasonable cluster. Table III shows the number of convergent generations and the degree of separation between clusters.   From these experimental results, it is found that Messy GA is superior to the conventional Simple GA from the viewpoints of reasonable clustered result and separability between clusters, the required time for clustering processes is much longer than Simple GA.

IV. CONCLUSION
Messy Genetic Algorithm (GA) is applied to the satellite image clustering. Messy GA allows to maintain a long schema, due to the fact that schema can be expressed with a variable length of codes, so that more suitable cluster can be found in comparison to the existing Simple GA clustering.
In Simple-GA, a gene has a fixed-length list structure, so a long schema is likely to be destroyed. In contrast, Messy-GA has a variable-length list structure and can store a long schema. As a result of comparing and evaluating the accuracy of the two clusters using 900 types of simulation data, the separation between clusters was shown to be about four times, and the result using Landsat TM image showed about 65% improvement, indicating that Messy-GA clustering turned out to be superior to Simple-GA. However, it was also found that the number of convergent generations was about 10 times higher for Messy-GA than for Simple-GA.
The author confirmed that both Simple-GA and Messy-GA surpassed the accuracy of k-means clustering, and also confirmed the tendency of accuracy improvement due to the difficulty of clustering, but the author will report these opportunities again.

V. FUTURE RESEARCH WORKS
Further experiments are required for validation of Messy-GA clustering effectiveness with the other remote sensing satellite images. Also, the applicability of the proposed Messy-GA clustering has to be attempted for not only remote sensing satellite image, but also the other images.