Cost Effective Hybrid Fault Tolerant Scheduling Model for Cloud Computing Environment

Cloud computing provides flexible and cost effective way for end users to access data from multiplatform environment. Despite of support by the features of cloud computing, there are also chances of resource failure. Hence there is a need of a fault tolerant mechanism to achieve undisrupted performance of cloud services. The task reallocation and duplication are the two commonly used fault tolerant mechanisms. But task replication method results in huge storage and computational overhead, when the number of tasks is increasing gradually. If the number of faults is high, it incurs more storage overhead and time complexity based on task criticality. In order to solve these issues, we propose to develop a Cost Effective Hybrid Fault Tolerant Scheduling (CEHFTS) Model for cloud computing. In this model, the Failure Occurrence Probability (FoP) of each VM is estimated by finding the previous failures and successful executions. Then an adaptive fault recovery timer is maintained during a fault, which is adjusted based on the type of faults. Experimental results have shown that CEHFTS model achieves 43% reduced storage cost and 13% reduced response delay for critical tasks, when compared to existing technique. Keywords—Cloud computing; failures; fault tolerant; critical tasks; scheduling; fault recovery; overhead


I. INTRODUCTION
Cloud computing is one of the new areas of technology that has emerged recently. On-demand resource sharing can be accomplished by utilizing internet and cloud services [1]. The primary goal is to offer services or distribute resources at the request of the customer with the lowest possible cost. There is a huge amount of heterogeneous resources [2], a vast user base, and a great range of application tasks in the cloud computing system. They have to deal with numerous individual activities, as well as big data. In terms of finance and science, cloud computing is a very valuable service [3]. A difficult challenge arises when deadlines are imposed on the jobs, since the required resources may be unavailable due to failures [4].
Along with missing project deadlines, fault tolerance is an important cloud computing aspect. It is designed to handle real-time tasks even if a failure occurs. Due to the higher storage overhead of taking backups, it is vital to use alternative approaches that have a high resource utilization ratio. Even though cloud features are attractive and a fault tolerant technique is required to guarantee uninterrupted cloud service functioning [5].
Both internal and external factors contribute to some of the flaws. The two most often utilized fault-tolerant strategies in cloud computing are task reallocation and duplication. In task reallocation, a task is reassigned after a fault occurs. This method boosts the overall system resource usage. However, failing to meet the deadline constraints [6] of jobs could cause the response time to be prolonged. The rise in the amount of electricity consumed while making resource allocations causes resource requests to fail in data centers.
Both traditional application components and new traditional application components may be heterogeneous in their component composition, composed of scientific techniques. With virtualization technology, these components have their own specialized execution environment. Workflows can also be implemented and run in cloud environments that offer an ever-growing resource pool [7] that you pay just for what you use. On-demand cloud resources can thus be acquired and shared on workflows. Clouds serve as a favored scientific process execution environment due to these characteristics.
However, processes that operate in the clouds face obstacles, because the complexity and dynamics of processes and cloud breakdowns occur more frequently. Failures tend to stop continuous implementation and have considerable impacts on the performance of processes, particularly for big, long-run workflows. It is crucial that scientific workflow scheduling fault tolerants be achieved in order to provide consumers with flawless experience.
Due to the time-consciousness of many scientific procedures, their successful completion does not depend exclusively upon the correct results but also on the time when those results are available. For a workflow that must complete the working phase and generate the computer results, a deadline is specified for the workflow. Failures lead to deadline conscious workflows that cannot be concluded on time without an appropriate faults-tolerant schedule. In that circumstance, QoS is badly damaged, although beyond the deadline [8] the results may be acquired. Therefore, it is required and crucial to plan a defect tolerant workflow since processes can be successfully completed before the deadline. 416 | P a g e www.ijacsa.thesai.org This paper is organized as follows. Section 2 presents the motivation and objectives. Section 3 presents the proposed methodology. Section 4 presents the simulation results. Section 5 presents the conclusion.

II. MOTIVATION AND OBJECTIVES
In fault-tolerant workflow scheduling [6], task replication method is used. Here duplicate multiple copies of tasks [9] can execute simultaneously. But it results in huge storage and computational overhead, when the number of tasks is growing. Moreover, it did not present the strategy for finding the exact number of duplicate copies.
The Power, memory, and other network elements were addressed in order to enhance the reliability of resources. In task scheduling, the VM with the greatest level of reliability is picked. Hussein El Ghor et al. [8] has presented two scheduling algorithms. The first approach uses energyefficient fault-free scheduling and the second approach provides essential slack time for fault recovery. The proposed model is limited in handling small group of tasks that cannot improve the performance levels.
Redundancy is a common way to protect against errors that can be done quickly or in space. Any additional resources required to perform the identical duplication or resilience replication process are designated as spatial redundancy. When considered in depth, an example of spatial redundancy may be parallel execution. There are considerable space expenses when you simultaneously run many jobs on different resources. In the case of redundancy, the task that was missed with respect to initial resources does not have to be redone because the clock was reset.
With Fault Tolerance Algorithm utilizing Selective Mirrored Tasks Approach (FAUSIT), they proposed a selective mirrored task method to ensure the correct balance between parallelism and topology for applications. DAGbased application fault tolerance is dealt with by limiting the number of make spans and minimizing computation costs.
For important jobs or tasks with permanent failures, the dynamic fault-tolerant workflow scheduling [10] uses spatial re-execution (SRE) and temporal re-execution (TRE) schemes in unison. However, as the number of issues detected for important operations is greater, the SRE strategy consumes more disc space. If the number of faults for non-critical tasks is excessive, then it has an adverse effect on time complexity. Additionally, the backups' odds of failure are not examined.
In FESTAL [11], for each primary task, a backup task is created and VMs are allocated for both. When the primary task is completed successfully within the execution time, then the execution of back task is terminated. But in this approach also, it results in huge storage and computational overhead [12], when the number of tasks becomes high. The performance levels are not satisfactory as the model causes overhead in the system.
In [13], a fault tolerant technique to ensure task completion deadline, has been proposed. But it does not check the types of tasks and storage overhead also is more that improves the delay levels in the system.
In [15], a trust based scheduling technique is proposed. For selecting the trusted computation service, set-based particle swarm optimization (S-PSO) and for selecting the storage service, ser covering problem (SCP) tree search are applied. The task scheduling concept in [16] is built on trust mechanisms. This model uses the Bayesian cognitive technique to determine the trustworthiness of computer nodes by examining the trust relationships that exist between them.
A research paper [17] outlines a trust-based Meta heuristic workflow scheduling technique called TMWS. The manual offers strategies and procedures to avoid the hazards associated with workflow scheduling.
Hence the fault tolerant algorithm should meet the following objectives: • To reduce storage overhead.
• To reduce time complexity.
• To reduce the chances of backup VM failures.
• To meet deadline of critical tasks.
• To reduce the response delay.
• To reduce the fault recovery time.
• To increase the CPU utilization.
In order to solve these issues, we propose to develop a cost effective hybrid fault tolerant scheduling scheme for cloud computing.

A. Overview
In this paper, Cost-effective Hybrid Fault Tolerant Scheduling (CHFTS) model is proposed. Figure 1 shows the block diagram of the CHFTS model.
The CHFTS model uses previous failures and successful executions to calculate the VM failure occurrence probability (FoP). The anticipated execution time (EET) is estimated for each job Ti. At this point, the projects with deadlines of less than one week are classed as high-priority activities. The designation of some VMs as fault-tolerant based on FoP. A fixed number of primary and backup VMs are assigned to each task. A recovery timer is initiated if a failure occurs in any VM. A notification will be delivered to the failed VM, and the VM will continue execution from the point in time that the failure occurred.

B. System Model
Virtualization divides each server in the Cloud system into a series of heterogeneous virtual machines (VMs). The fact is that a VM is an essential component of a cloud system.
Suppose that a cloud system offers a set of VM resources, as given by.  The model represents the system structure, as seen in Figure 2. Processing capacity and cost per hour are particularly important for selecting VM(k) because of processing capacity and cost per hour, respectively (k). Users can access an endless number of VMs on cloud computing platforms that utilize virtualization. In addition, all VMs are hosted in one cloud data center to provide consistent bandwidth for all the VMs.

C. Fault Model
If a task is incomplete because of internal or external issues, it may be considered as a task failure.
Faults may occur during the execution of scientific work flows in cloud computing environments. There are various reasons behind these faults. The most common reason for task failure is the failure of corresponding VM. The other reasons for faults include unavailability of enough resources, resource overloading, delayed execution time etc.
There are two cases of faults that may occur in cloud, during execution. The first one is a permanent fault and other one is a temporary fault. In temporary faults, the faults may be recovered within short span of time whereas in permanent faults, the faults cannot be rectified until the failed element is repaired or replaced. Fail-signal or acceptance test is a fault detection mechanism, commonly used. Due to task failures, the make span time of work flows will be increased and the service-level agreement (SLA) will be violated. Re-execution is one of the commonly used less expensive fault tolerant technique, which improves the reliability of workflows. Re-execution can be implemented in two methods: Spatial re-execution on other resources (SRE) and Temporal re-execution on the same resources, after fault recovery (TRE).

1) Estimation of FoP:
The exponential probability density function for a specified mean time (M) between failures, is given by.
Then, the probability of a failures, that occurs before time Δ for a VM is given by.
Hence the probability of successful completion during the time Δ, is given by.
Let be the computation time for a subtask executing on a VM and be the computation interval between two check points.
Then, the average number of attempts ( A) required to finish is given by.
The number of success is given by Hence the number of failures NoF, during Δ is given by Let S(t i ) and W(t i ) be the workload size of given input task t i .
If and only if all the input data has been received from all prior tasks, then the task is started (ti). start time of t i is signified by Eq. (9).
Note that if t i =t enrty , then T start (t i )= 0. Then, the transmission time is computed by, where BW is the bandwidth between two VMs.
Then, the execution time is given by, Thus, the end time of task t i is given by.
Hence the make span of workflow is given by. For each task, a pair of VMs <VM p , VM b > from the k VMs, are allocated, where VM p and VM b are primary and backup VMs. If there is a fault occurs in VM p , then a fault recovery timer (T FR ) will be started.
Case-1: If the task is a critical task, then T FR is fixed such that.
T FR < EET-FOT, where FOT is the time of fault occurrence.
Case-2: If the task is a non-critical task, then T FR is fixed such that.
If the fault cannot be recovered within T FR , (ie) If fault recovery time (FRT) is more than T FR , then immediately, a notification will be sent to VM b , which will resume the execution from that point.

Algorithm -Fault Tolerant Scheduling
Input : List of VMs and Tasks Output: fault tolerant scheduling 1. foreach VMj 2.
End if 6. End for 7. foreach task t i 8.

end For
Special outcomes of this approach 1) Since fault free VMs are selected, the chances of faults are less.
2) It does not result in huge storage since the VMs are allocated only when the fault is not recoverable (permanent fault) within a time span.
3) The time critical tasks will be executed within their deadlines since the tasks are migrated to backup VM, within the deadline.

4)
The non critical tasks can be executed in the same VMs if the fault is recovered within tolerable time span (which is an even multiple of its deadline). 5) If the task could not be recovered beyond the tolerable time span, then only the non critical task are migrated to backup VMs, thereby reducing the storage overhead significantly.
6) Since the execution of tasks starts in the backup VM only from the time of fault, the time complexity will be less.

IV. EXPERIMENTAL RESULTS
The proposed CEHFTS model is compared with the Spatial-Temporal Re-Execution method (SRE) [10]. The NASA workload [14] has been used as the emulator of Web users requests to the Access Point (AP). This workload represents realistic load deviations over a period time. It comprises 100960 user requests sent to the Web servers during a day. Table 1 shows the experimental parameters assigned in this work.

A. Results for Critical Tasks
In this section, work flows with strict deadlines are considered (critical tasks) and allocated to VMs. The FoP of each task is increased in terms of fault rate. The performance metrics response delay, percentage of missed deadlines, CPU utilization and storage overhead of server are considered for evaluation. Table 2 Performance Results of CEHFTS and SRE schemes for fault rate of critical tasks. Figure 3 shows the response delay measured for the critical tasks when the fault rate is increased from 1 to 2. Since in the case of time bound critical tasks, backup VM migration is mostly supported, the response delay will be high. Hence, CEHFTS has the delay in the range of 1.4 to 2.8 seconds whereas SRE has delay in the range of 1.7 to 3.1 seconds. However, in case of CEHFTS, the failed tasks are resumed when the recovery time is less than EET. Hence it has 13% reduced response delay, when compared to SRE scheme.  Figure 4 shows the missed deadlines for the critical tasks when the fault rate is increased from 1 to 2. Since in the case of time bound critical tasks, if the recovery time is higher than EET or if the fault is permanent fault, the chances of deadline miss are high. As depicted from the figure, CEHFTS has the missed deadlines in the range of 18% to 41% whereas SRE has missed deadlines in the range of 25% to 47%. However, in case of CEHFTS, the failed tasks are resumed when the recovery time is less than EET. Hence it has 19% reduced missed deadlines, when compared to SRE scheme.    Figure 5 shows the CPU utilization for the critical tasks when the fault rate is increased from 1 to 2. In case of permanent failures, the CPU utilization of primary VMs will be affected for the critical tasks. As it can be seen, from the figure, the utilization of CEHFTS reduces from 57% to 41% and the utilization of SRE reduces from 47% to 33%. However, CEHFTS has 18% higher CPU utilization than SRE scheme, since it minimizes the use of backup VMs.  Figure 6 shows the storage cost of servers in case of critical tasks. Since in case of time bound critical tasks, if the recovery time is higher than EET or if the fault is permanent fault, the tasks are migrated to backup VMs, leading to increased storage cost. As depicted from the figure, CEHFTS has cost in the range of 150 to 470 whereas SRE has cost in the range of 270 to 840. However, in case of CEHFTS, the failed critical tasks are resumed when the recovery time is less than EET. Hence it has 43% reduced storage cost, when compared to SRE scheme.

B. Results for Non-Critical Tasks
In this section, work flows with elastic (flexible) deadlines are considered (critical tasks) and allocated to VMs. The FoP of each task is increased in terms of fault rate. The performance metrics response delay, CPU utilization and storage overhead of server are considered for evaluation.   Figure 7 shows the response delay measured for the noncritical tasks when the fault rate is increased from 1 to 2. Since in case of non-critical tasks, the fault response timer is set as high, the response delay will be higher than that of critical tasks. Hence, CEHFTS has the delay in the range of 2.6 to 6.4 seconds whereas SRE has delay in the range of 3.5 to 8.3 seconds. However, in case of CEHFTS, the failed tasks are resumed when the recovery time is less than EET. Hence it has 25% reduced response delay, when compared to SRE scheme. Figure 8 shows the CPU utilization for the non-critical tasks when the fault rate is increased from 1 to 2. In case of permanent failures, the CPU utilization of primary VMs will be affected for the non-critical tasks also. As it can be seen, from the figure, the utilization of CEHFTS reduces from 52% to 40% and the utilization of SRE reduces from 45% to 30%. However, CEHFTS has 16% higher CPU utilization than SRE scheme, since it minimizes the use of backup VMs. Figure 9 shows the storage cost of servers in case of noncritical tasks. Since, if the fault is permanent fault, the tasks are migrated to backup VMs, leading to increased storage overhead. As depicted from the figure, CEHFTS has cost in the range of 120 to 270 whereas SRE has cost in the range of 165 to 472. However, in case of CEHFTS, the failed noncritical tasks are resumed when the recovery time is less than EET. Hence it has 35% reduced storage cost, when compared to SRE scheme.

V. CONCLUSION
In this research, a new hybrid fault tolerant scheduling approach, called Cost Effective Hybrid Fault Tolerant Scheduling (CEHFTS), is proposed for cloud computing. For this model, the prior failures and successful executions are used to estimate the VM failure probability (FoP). Estimated anticipated execution time (EET) is calculated for each task. At this point, the projects with deadlines of less than one week are classed as high-priority activities. An adaptive fault recovery timer is kept running until a fault occurs, which is then increased as a function of the various fault types. Experiments are conducted for critical and non-critical tasks by varying the fault rate. Experimental results have shown that CEHFTS model achieves 43% reduced storage cost and 13% response delay for critical tasks, when compared to existing technique. Future work focus on grouping the tasks based on its type and requirements so that the CPU utilization and battery power can be further reduced. The task deadline and time scheduling can still be improved to enhance the performance levels.