Implementation of a Beowulf Cluster and Analysis of its Performance in Applications with Parallel Programming

In the Image Processing Research Laboratory (INTI-Lab) of the Universidad de Ciencias y Humanidades, the permission to use the embedded systems laboratory was obtained. INTI-Lab researchers will use this laboratory to do different research related to the processing of large scale videos, climate predictions, climate change research, physical simulations, among others. This type of projects, demand a high complexity in their processes, carried out in ordinary computers that result in an unfavorable time for the researcher. For this reason, one opted for the implementation of a high-performance cluster architecture that is a set of computers interconnected to a local network. This set of computers tries to give a unique behavior to solve complex problems using parallel computing techniques. The intention is to reduce the time directly proportional to the number of machines, giving a similarity of having a low-cost supercomputer. Different performance tests were performed scaling from 1 to 28 computers to measure time reduction. The results will show if it is feasible to use the architecture in future projects that demand processes of high scientific complexity. Keywords—High-performance cluster; distributed programming; computational parallelism; Beowulf cluster; highefficiency computing


I. INTRODUCTION
With the growth of technological advances related to the world of computing, new techniques are emerging that take full advantage of computers that are interconnected by the same local network. The idea is to meet specific needs in less time than an ordinary computer. Performing processes of high availability, efficiency, and performance has become critical to providing better services, optimizing time resulting from complex problems and having continuous availability.
In this work is carried out the implementation of a highperformance cluster type Beowulf, which is the set of low-cost computers interconnected by a network to solve problems of high scientific complexity in less time [1]. This work was done due to the proposal of new projects where the time of the result takes hours or days depending on the complexity of the algorithm. Thanks to this work, one is able to reduce the time in providing results using distributed programming.
The work was done in the embedded systems laboratory of the University of Sciences and Humanities for scientific purposes that would benefit the INTI-Lab to perform simulations of high scientific complexity. These types of architectures are commercially considered supercomputers because they are designed to increase computing power by allowing parallel processing of tasks and high-speed communication [2]. Twenty-eight computers will be used, of which a performance test will be performed using a specialized algorithm. The idea is to measure their scalability and determine the exact number of computers to be used due to bottlenecks that can occur in the communications network.
The computers used in the architecture will continue being of first use for the university student. For that reason, there is a calendar for its scientific use without causing inconveniences at the time of using the laboratory. Reusing the hardware resources for the implementation of this work is not a new idea. One has the Polytechnic University of Altamira [3] that uses the five computers of the optimization laboratory and networks to solve complex problems with a specialized package to measure its scalability. Recycled computers should not be wasted, and a focus can be given to their use in a cluster architecture. As is the case of the National Engineering University [4], which uses recycled computers to obtain a maximum computational benefit. All these architectures that were mentioned are called Beowulf clusters. They are so-called because of the low or regular computing resource [5]. They will use four elements of hardware that are the RAM, the central processing unit (CPU), and its network card.
In this work, the maximum potential of the computers of the laboratory of embedded systems will be used, being the completion of this work a supercomputer with low computer resources.

II. METHODOLOGY
The cluster architecture will be implemented in Laboratory 302-B that will use the 29 computers interconnected to a switch. A computer will be in charge of distributing the problem to the 28 remaining computers that will solve a problem applying techniques of computational parallelism. The characteristics of hardware are of low cost; for that reason, the name of cluster Beowulf is chosen that will be detailed with precision in another section of this work. As can be seen in Fig. 1, in the lower-left is specified in the operating system, and the tool one is using for the realization of parallelism between the nodes because they are computers also used for www.ijacsa.thesai.org the academic field all will be connected directly to the public network. Concerning the issue of security from the orchestrator computer will generate a unique security key thanks to the Secure Shell protocol, which will be copied to each laboratory computer to prevent an external agent cannot access not having access permissions.

A. Cluster Arquitecture
The cluster architecture can be of three types: high performance cluster (HPC), high availability cluster (HA) and high efficiency cluster (HP) [6] in this work an HPC architecture was implemented that uses powerful tools and processes of computing to generate data in advanced academic research [7], this type of architecture was chosen to take maximum advantage of computing resources in order to obtain successful results in less time, the more project proposals there are the need to use other types of architecture making the future laboratory of the institution a hybrid cluster, the characteristics of the HPC of the laboratory are as follows.

1) Master node:
It is the computer in charge of distributing the problem applying parallel programming to distribute it to the slave nodes in order to give a result in less time, in it, one can see the ecosystem of the architecture installing some monitoring packages.
2) Slave node: These computers have two functions, one of them is to take a portion of the general problem that is distributed by the master node and return the final result when it finishes processing, they do not need to have a graphical user interface because it basically needs to be connected to the network to extract the number of cores for the process.
3) Communication network: It is the means that will help the communication between the slave nodes and the master. The better the communication by the network equipment, the lower the network traffic, making the performance of the processes more favorable.
4) Secure shell protocol: In this work, one uses the Secure Shell (SSH) thanks to this protocol the master node can interact securely with the slave node, for information security reasons a unique key was generated from the master node and copies were made to the slave nodes to have a secure cluster architecture.

5) Paralleling tools:
To make computers work in parallel requires specialized tools, in this case, was used Open Mpi which is an implementation of open-source message step interface 1 , this tool will help the realization of computational parallelism techniques, in the taxonomy of Flynn extracted from [8] shows two types of parallelism. a) MISD: Applies the technique of multiple instructions to data where its functional units perform different operations on the same data. b) MIMD: Applies multiple instruction techniques, multiple data used to achieve parallelism, the machines that use these techniques have several processors that operate asynchronously and independently.

B. Architectural Status Monitoring
One of the most common cases in the use of a Beowulf cluster is the disconnection that can occur in some of the computers, to access to each one of it to make sure that it has communication can be a tedious work even more in the case of 29 computers, for this problem an algorithm was developed in the programming language Python that shows a report of the nodes with their state of connectivity in Fig. 2 the pseudo code is shown.
As a result, we have the following report, as shown in Fig. 3 that shows 28 slave nodes and one master node, all activated in a network. www.ijacsa.thesai.org

C. Cluster Beowulf
The implementation of the work is of Beowulf type that is denominated with this prefix for the use of components of hardware of low cost that behave as if they were an only computer [9] the computers of the laboratory of embedded systems are used, space where the students of the university carry out their academic activities, therefore, the laboratory has a particular schedule for the accomplishment of investigations, in this architecture was used a number of 28 slave computers and a master computer using a number of 196 cores that will process the problem applying parallel techniques computational parallelism distributing the problem to each of its cores for obtaining results in less time, Fig. 4 shows the computers used in the performance testing process.
All equipment has the same hardware characteristics, as shown in Table I.

D. Performance Testing
In this work, performance tests were performed to measure scalability with an intensive calculation algorithm for the sum of prime numbers used in the C++ programming language.

1) Parallel algorithm for calculating prime numbers:
For the performance tests, an algorithm used in previous work with virtual machines was selected [10] that applies parallelism techniques thanks to Open Mpi that performs intensive iterations to each of the numbers to validate if it is a prime number. In this work, an intensive calculation is carried out by testing 2, 4, and 8 million iterations, Fig. 5 shows the pseudocode.

III. RESULT
In this section, performance tests are shown using the algorithm of the calculation of prime numbers using distributed programming. Scalability measurements are made using from one to 28 slave nodes of the Beowulf cluster.
For the execution of the algorithm, we will use a line of code from the console of the master node that is: mpirun -np # -hostfile ../.mpi_hostfile ./primos, where the symbol # represents the number of cores and .mpi_host file the number of slave nodes and ./primos the execution of the compiled algorithm. The more slave nodes used, the more kernels must be included from the Linux console to deliver the results in less time.
In Table II, the first tests are performed with 2 million iterations using from one to 28 slave nodes. In Fig. 6, the same result is shown in the form of a statistical graph, taking into consideration the number of slave nodes versus the time of the result.
In Table III, the tests are performed with 4 million iterations using from one to 28 slave nodes. In Fig. 7, the same result is shown in the form of a statistical graph, taking into consideration the number of slave nodes versus the time of the result.
In Table IV, the tests are performed with 8 million iterations using from one to 28 slave nodes. In Fig. 8, the same result is shown in the form of a statistical graph, taking into consideration the number of slave nodes versus the time of the result.
Finally, Fig. 9 shows a statistical graph of the number of slave nodes versus the result time, taking as reference the three performance tests used in the previous tables.

IV. DISCUSSION AND CONCLUSIONS
The work serves as a starting point for the realization of algorithms of high scientific complexity. It has scheduled a schedule of continuous improvement where the activities will be carried out depending on the need that arises in the direction of research of the Universidad de Ciencias y Humanidades. Improvements will include high availability to obtain large volumes of information using Big Data techniques as it does in [11]. This work has similarity concerning the measurement of scalability to use more nodes that demonstrate the efficiencies of these HPC architectures with Big Data Hadoop. Concerning the results section, one sees a reduction in time when more odd nodes are used. This situation is due to its cores that carry out the work in parallel. This work can be improved using a higher number of cores without having to resort to using a new slave node, as one has the case of [12]. This work of the Universidad Nacional de Ingeniería that takes full advantage of the GPU that each computer has demonstrated that the performance is five times higher compared to using CPU.
In future works, related to the increase of the Beowulf cluster potential, it will be proposed to include graphics cards. These graphic cards will make this architecture more powerful using PyCuda.
This work demonstrates that the use of a Beowulf cluster architecture using embedded systems laboratory computers reduces time without the need to acquire specialized equipment. As shown in the results section of Fig. 9, the more complex the problem, the more efficient the slave nodes will be, concluding that the implementation of this architecture meets the proposed objectives.