Classification Model of Municipal Management in Local Governments of Peru based on K-means Clustering Algorithms

The K-means algorithm groups datasets into different groups, defines a fixed number of clusters, iteratively assigning data to the clusters formed by adjusting the centers in each cluster. K-means algorithm uses an unsupervised learning method to discover patterns in an input data set. The purpose of the research is to propose a municipal management classification model in the municipalities of Peru using a K-means clustering algorithm based in 58 variables obtained from the areas of human resources, heavy machinery and operating vehicles, information and communication technologies, municipal planning, municipal finances, local economic development, social services, solid waste management, cultural, recreational and sports facilities, public security, disaster risk management, environmental protection and conservation of all the municipalities of the 24 departments of Peru and the constitutional province of Callao. The results of the application of the K-means algorithm show that 32% of the municipalities made up of the municipal governments of Amazonas, Apurímac, Huancavelica, Huánuco, Ica, Lambayeque, Loreto and San Martin; are in Cluster 1; the 8% in Cluster 2 with the municipal governments of Ancash and Cusco; in the third Cluster the 28% with the municipal governments of the constitutional Province of Callao, Madre de Dios, Moquegua, Pasco, Tacna, Tumbes and Ucayali and in Cluster 4, 32% composed of the municipal governments of Arequipa, Ayacucho, Cajamarca, Junín, La Libertad, Lima, Piura and Puno Region. Keywords—K-means; cluster; municipality; model; municipal management


I. INTRODUCTION
The clustering problem is one of the most studied topics in the data mining and machine learning communities. There is a wide variety of applications in social networks problems, multimedia, social science, health, education and other fields of knowledge. Grouping is a diverse topic, and the underlying algorithms are highly dependent on the data domain and the scenario in which the problems occur. The objective of clustering is to classify a set of elements into groups that are very similar among them, but different with elements from other groups. Author in [1] consider the k-Means grouping algorithm, to be one of the most efficient grouping algorithms for large-scale data sets.
The K-means algorithm allows clustering by grouping objects into k groups, this is why it becomes very important for researchers and so its results, which will be used in the municipalities of Peru that promote local development, but also adapting themselves to current organizations' real situation; mainly with the objective to improve the provision of local public services; consequently it will allow a continuous improvement in the offered services.
The current investigation states the development of a model based on the K-means algorithm. It proposes an adjustment of the grouping algorithm, focusing on the detection of behavior patterns, for that it needs to read Peruvian municipalities databases with 58 variables, allowing the decision-making process to improve in all the levels.
Provincial and district municipalities are the governing bodies that promote local development, with legal status under public law and full capacity to fulfill their purposes (Law No. 27972 -Organic Law of Municipalities). Local governments are classified according to their jurisdiction: provincial municipality, located on the territory of the respective province and the provincial capital district. District municipality, located on the territory of the district and town center municipality, whose jurisdiction is determined by the respective provincial council.
The National Institute of Statistics and Informatics -INEI has statistical information of the provincial, district, and population centers at the national level, in order to generate municipal indicators that support regional and local management for planning and proper decision-making. Information that has been used in the investigation of the 196 provincial municipalities, 1,678 district municipalities and 2,656 municipality of populated centers of the country, compiled in the National Register of Municipalities 2019 [2]. The factors considered in the study are related to human resources, heavy machinery and operational vehicles, information and communication technologies, municipal planning, municipal finances, local economic development, social services, solid waste management, cultural, recreational and sports facilities, public security, disaster risk management and environmental protection and conservation.
The purpose of the research is to propose a classification model for municipal management of local governments in Peru based on K-means clustering algorithms, at the same time evaluating the characteristics of local governments in every cluster. 568 | P a g e www.ijacsa.thesai.org This article is organized as follows: The second section is a review of some studies on the k-means algorithm and municipalities features. The next section focuses on the theoretical background. Section fourth shows the results of the research, while the fifth part presents the argumentation of the results. Finally, the conclusion of the research, which proposes some ideas for further investigation, are shown in section sixth.

II. RELATED WORKS
This section presents the references of different investigations related to K-means and municipal management.
In [3] two grouping algorithms are compared, centroid K-Means algorithms and Fuzzy C-Means, based on their grouping efficiency, the conclusion states the K-Means algorithm is better than the Fuzzy C-Means algorithm, and considers they can be used to discover association rules and functional dependencies.
Customer segmentation is the subdivision of a business customer database into groups called customer segments, so that each customer segment consists of clients who share similar market characteristics. In [4] the k-Means grouping algorithm has been applied in customer segmentation in a retail business, identifying four steady groups or customer segments.
In [5] proposes a better K-Means algorithm to improve the classification precision, when K-Means cannot adequately classify data under certain data distribution conditions. The proposal considers the effect of variance on the classification so that the data can be classified with greater exactness.
In [6] presents a data model capable of extracting, classifying and then mapping data in order to generate new, more structured data that meets the organization's needs. This arrangement is based on the K-means grouping algorithm.
Data recording on the Internet is a way of Big Data to use the K-Means technique as a solution to the analysis of user's behavior. [7] In this research, a grouping process has been carried out using the K-means algorithm, an algorithm that classifies users into three groups, high, medium and low. The result of the research shows that each of these groups visits frequently some websites, through search engines, social networks, and news and information.
In [8], a balanced K-means grouping algorithm is proposed to classify apples automatically. The results show the precision of the multiple characteristics classification method is more than 96%. [9] Grouping is a data analysis technique that is used to investigate the underlying structure of the data. It is described as the technique that groups objects which have similar characteristics.
It suggests a methodology to investigate the inherent patterns in the relationships between air traffic and macroeconomic development. To do so, it uses data mining techniques, including the grouping of K-means. The most important contribution in the methodology is the ability to select variables objectively and quickly [10].
In [11] four groups of municipalities were identified based on socioeconomic indicators. The purpose was to examine the socioeconomic differences among Slovenian municipalities and classify them into relatively homogeneous groups. The classified groups based on socioeconomic indicators reflect their development features, the results confirm the fact that the eastern part is less developed meanwhile the western part is the most developed in Slovenia. There is a small group of municipalities where the socioeconomic situation is grave.
In [12] the purpose was to identify municipalities where aspirations for energy autonomy could make technical and economic sense, consequently replicate successful projects in other municipalities within the same group; a cluster analysis to establish a municipal typology is used, in order to analyze the techno-economic municipalities appropriateness, for the autonomous energy systems. The results identify municipalities which successful measures from other municipalities can be applied to and provide a basis for future energy studies around the country.

III. THEORETICAL BACKGROUND
The K-Means technique is an unsupervised clustering algorithm, used with large amounts of data. The objective of the K-Means algorithm is to find "K" groups (clusters) among the data set. This algorithm is a grouping technique that is used in different machine learning applications [13] [14].
The K-means clustering algorithm [9] has been discovered more than 50 years ago by Steinhaus (1956), since then it has been applied in various fields of knowledge such as marketing, psychology, medicine, social sciences and biology, becoming one of the most widely used methods for its simplicity, easy implementation and efficiency [15].
The K-Means algorithm works iteratively by assigning each row of the input data set to one of the "K" groups based on their characteristics. The columns (variables) are grouped considering the similarity. The result of running the algorithm is: • Each group centroids are coordinates of each K set that will be used to label new data sets.
• Labels for the training dataset. Each tag belongs to one of the K clusters formed.
The clusters adjust to a new position in each iteration of the process, until the algorithm converges. Once the centroids have been found, the data must be analyzed to observe which characteristics are unique, regarding the other groups. These groups are the labels that the algorithm generates.

A. K-Means Algorithm
The K-means algorithm begins by specifying a set of initial cluster centers that are derived from data. Then assigns the data to the most similar cluster, based on the input variables values. After all cases have been assigned, the cluster centers are updated to reflect the new dataset assigned to each cluster. Then the records are checked once more to see if they need to be reassigned to a different group, and the data allocation and iteration process continues until the maximum number of iterations is reached or the change between one iteration and the next does not exceed a specified threshold. Depending on the similarity or dissimilarity characteristics, data sets are grouped into several different groups, for similar data within the same group and for different data among groups [16].
The K-means algorithm is one of the most important unsupervised clustering algorithms [17] that produces high quality results in less computation time [1]. The K-means machine learning algorithm [18] is used to group a known, assumed, or indicated in advance dataset. In [5] K-means is a classic prototype-based partitioning grouping technique that attempts to group data into k groupings that have been specified by the user.
2) Initialize the k centroids of the group.
3) Assign the n data points to the closest clusters. 4) Update the centroids of each group. 5) Repeat steps 3 and 4 until there are no more changes in the positions of the centroids.
[19] The k-means algorithm is one of the most widely used grouping algorithms, it is designed to group numerical data, in which each grouping contains a center called centroid. The algorithm works with cases in which all the variables are the quantitative type, and the Euclidean quadratic distance [20].
it is chosen as a difference measurement. Take into account the weights in the Euclidean distance can be used by redefining the values.
The scatter points can be written as: In the k-means algorithm, the sum of the squares of the Euclidean distances of data points to their closest representatives, is used to quantify the objective function of the clustering [20]. Therefore, we have: where � = �� 1 , … , � � , is the vector of means associated with the k-th cluster, and . Thus, the criterion is to give the N observations to the K clusters so within each cluster the average of the differences of each observation to the mean of the cluster, defined by the cluster points, might be minimal.

IV. RESULTS
This section shows the classification of municipal management in Peru local governments, after applying the K-Means Clustering algorithm.

A. Research Variables
58 variables have been used that have been gotten from the areas of human resources, heavy machinery and operational vehicles, information and communication technologies, municipal planning, municipal finances, local economic development, social services, solid waste management, cultural, recreational and sports facilities, public security, disaster risk management, protection and conservation of the environment of all local governments in the 24 departments of Peru and the constitutional province of Callao. The variables were the following:  1 shows the K-means algorithm used to establish the classification of municipal management in local governments in Peru. Fig. 2 shows the classification of local governments by department. The K-Means algorithm considers that the 32% (8) of the local governments divided by departments in Peru is in cluster 1, the 8% (2) of them to cluster 2, the 28% (7) to cluster 3 and to cluster 4 another 32 % (8).      According to Table V, from the total municipal income collected it is observed that in cluster 1 the 46% is from Central Government transfers, financing 42%, current Income 12% and capital Income 0,03%. Cluster 2 shows that 67% corresponds to transfers from the Central Government, 25% to financing, 8% to current income and 0.07% to capital income. In cluster 3 it is seen that 51% comes from transfers from the Central Government, 29% to financing, current income 20% and capital income 0.42%. Cluster 4 shows that 50% corresponds to transfers, 36% to financing, 14% to current income and 0.12% to capital income.  On the other hand, in cluster 1 it is shown that 63% of the expenses executed in the municipalities were spent capital expenses, cluster 2 shows 67%, in cluster 3 42% to capital expenses and in Cluster 4 the municipalities allocated 59% to capital expenses.   Table VIII shows that 67 municipalities collect solid waste, on an average of 411,894 kilos per department in cluster 1. In cluster 2, there are 137 municipalities that collect solid waste, with an average amount of 742400 kilos per department. In cluster 3 there are 18 municipalities that collect solid waste, with an average amount of 297,729 kilos per department. Cluster 4 shows 104 municipalities that collect solid waste, with an average amount of 816506 kilos per department. Table IX shows in cluster 1, the number of users served in the municipal library was 23075 users on average, 2741 users in cultural centers, in theaters 2471 and in museums 1795 users. In cluster 2, 103883 users were served in the municipal library, 19630 users in cultural centers, 79457 users in the theaters and in museums 45623 users. Cluster 3 served 5881 users in the municipal library, in cultural centers 2472 users, in the theaters 7649 and in museums 4768 users. In cluster 4 the municipal library served 73529 users, 13133 users in the House of Culture, in theaters 13917 users and in museums 14930 users.    Fig. 4 shows the classification of municipal management in the local governments of Peru in the 24 departments and the constitutional province of Callao, resulting from the application of the K-Means clustering algorithm. From 100% of the local governments of the departments of Peru, it is observed that 32% of them are in Group 1, 8% in Group 2, 28% in the Third Group and 32% are found in Group 4. In [11] the classification of four groups of municipalities was the focus, because it offers a more detailed view of the socio-economic differences that exist between Slovenian municipalities and for that reason the results are more informative too. The variables that were used were socioeconomic development indicators. The K-Means algorithm is used to improve the results of Ward's method.

VI. CONCLUSIONS
This work uses the K-means algorithm to form a classification model for municipal management of local governments in Peru based on their indicators. The results have been presented in four groups, which shows a more detailed and informative picture of municipal management. The factors considered are related to human resources, heavy machinery and operational vehicles, information and communication technologies, municipal planning, municipal finances, local economic development, social services, solid waste management, cultural, recreational and sports facilities, public security, disaster risk management and environmental protection and conservation factor. The application of the K-means algorithm is a contribution to the municipalities, in order to improve the municipal management by making right decisions for the benefit of their inhabitants.

VII. FUTURE WORK
Research efforts in this work, have been focused on some specific questions, the application of K-means algorithms in the management of municipalities; the analysis of municipal management by regions has been reserved for future work.