Enhancing Elasticity of SaaS Applications using Queuing Theory

Elasticity is one of key features of cloud computing. Elasticity allows Software as a Service (SaaS) applications' provider to reduce cost of running applications. In large SaaS applications that are developed using service-oriented architecture model, each service is deployed in a separated virtual machine and may use one or more services to complete its task. Although, scaling service independently from its required services propagates scaling problem to other services, most of current elasticity approaches do not consider functional dependencies between services, which increases the probability of violating service level agreement. In this paper, architecture of SaaS application is modeled as multi-class M/M/m processor sharing queuing model with deadline to take into account functional dependencies between services during estimating required scaling resources. Experimental results show effectiveness of the proposed model in estimating required resources during scaling virtual resources.


INTRODUCTION
In the last few years, Software as a Service (SaaS) has rapidly spread in many areas. SaaS is a software delivery model in which software is delivered to customers as a service [1]. Instead of delivering individual application instance for each tenant, one application instance serves thousands of tenants [2]. Nowadays, several SaaS companies, such as Salesfore.com, NetSuite, and Success Factors, utilize elasticity feature of cloud computing to ensure lowest cost of service delivery. However, developing multi-tenant SaaS application to serve thousands of tenants with thousands of users for each tenant is a very hard and expensive task due to large number of factors that have to be considered during development phases, such as customizability, security, scalability, and pricing.
Most of current SaaS applications have been developed using service-oriented architecture (SOA) model [1]. In SOA model, each application is a collection of services that are organized in several layers. Each service uses services in the lower layer to complete its tasks. In large SaaS applications, each service is deployed in a separated virtual machine. Although, one of primitive assumptions is that scaling any service has to be reflected in all required services, most of current researches do not consider functional dependencies between services and scale them separately. As consequence, scaling problems are shifted from layer to next layer. Unfortunately, the problem is not only specifying functional dependencies between services but also specifying number of virtual machines that have to be added or removed.
For example, suppose we have three services X, Y, and Z. Service X uses services Y and Z to complete its tasks. Service X receives three types of requests A, B, and C. Service X uses service Y to complete requests of type A, uses service Z to complete requests of type B, and uses service Y and service Z to complete request of type C. If service X is detected as overloaded, scaling service X independently from Y and Z moves overloading problem to Y, Z, or both of them. However, which service has to be scaled and what is the optimal number of VMs instances that have to be added to or removed from each service? This depends on types of arriving requests. If overloading is occurred due to high number of requests of type A, then adding more VMs to service Z will waste resources and reduce revenue. Collecting such information without modeling functional dependencies is a very hard task.
Thus, this paper models SaaS applications as multi-class M/M/m processor sharing queuing model with deadline to consider functional dependencies and requests' types during estimating required scaling resources. The proposed model reflects scaling actions on many metrics such as CPU utilization, response time, and throughput, which are commonly used by most of current auto-scaling techniques to trigger auto-scaling actions. Therefore, SaaS application providers can apply the proposed model with any auto-scaling technique to put into account functional dependencies between services.
Queuing network models have been extensively applied in many areas and have proven their efficiency in representing and analyzing resource-sharing systems such as computer systems [3]. According to Kendall's Notation, the first M in M/M/m queuing model represents arrival process, which is Markov arrival process. It has been theoretically proved that if large number of customers make independent decisions of when to request service, the resulting arrival process will be www.ijacsa.thesai.org Markov arrival process [4]. The second M in M/M/m queuing model represents service process, which is Markov service process. Third m represents number of parallel servers that provide one service. Servers receive requests from different classes and serve them according to processor sharing discipline.
Effectiveness of the proposed model has been evaluated by comparing performance of auto-scaling algorithms with and without the proposed model. Simulation results show that the proposed model reduces violation of Service Level Agreement and increases revenue.
The rest of this paper is organized as follows. Section 2 describes the related work. Section 3 briefly describes the proposed model. Section 4 experimentally demonstrates the effectiveness of the proposed model. Finally, Section 6 concludes.

II. RELATED WORK
Although, several auto-scaling approaches have been proposed in the last few years [6,7,8,9,10], most of current auto-scaling approaches do not consider functional dependencies between application's services. Current autoscaling approaches can be categorized into two main categories: reactive and proactive approaches. Reactive autoscaling approaches scale computational resources based on some rules and according to some metrics such as memory utilization, CPU utilization, throughput, and response time [15,16,17,18]. However, relations between metrics of related services are not modeled. Therefore, impact of scaling service is unknown until its occurrence.
In another hand, proactive auto-scaling approaches trigger auto-scaling operations based on predicted workload. Different time series techniques such as Support Vector Machine, Exponential Smoothing, and Neural Networks have been used in predicting future workload [13,14,17,19,20]. Although, functional dependencies between application's services are very effective factors in predicting future workload, most of current proactive techniques do not consider it. This section overviews some of current approaches.
Biswas, et al. [5,21] have proposed framework to provide virtual private cloud for a single client enterprise. Proactive auto-scaling technique has been proposed to provision and release resources from public cloud according to predicted system load. Support vector machine and linear regression have been employed to predict future load. In [6] Biswas, et al. have proposed a reactive auto-scaling algorithm to serve incoming requests with considering their service level agreements. The proposed algorithm scales resources based on profit that is gained from serving incoming requests and based on cost benefit to the user.
Sellami et al. [7,8] have proposed threshold based autoscaling approach to offer dynamic service instances for multitenant business processes. The proposed approach considers functional dependencies between each multi-tenant process and its services during deciding scaling action. The proposed approach has been encapsulated into middleware layer between software and platform layers.
Xiao et at. [9] have modeled automatic scaling problem as Class Constrained Bin Packing problem where each server is a bin and each class represents an application. To scale provisioned resources, semi-online color set algorithm has been proposed. However, they have encapsulated each application instance inside a virtual machine (VM), which is not applicable in large applications.
Ahn et al. [10] have proposed auto-scaling method to support execution deadline. The proposed method can handle Bag-of-Tasks jobs and workflow jobs. Jobs in Bag-of-Tasks can be scheduled separately from each other while jobs in workflow have to be scheduled in order of its dependency. The proposed method has been evaluated by using Cloudsim, which shows that the proposed auto-scaling method increases resources utilization.
Chaloemwat et al. [11] have tried to enhance performance of threshold-based auto-scaling techniques by using Skewness algorithm and VMs migration. The effectiveness of the proposed enhancement has been proven by comparing performance of threshold-based auto-scaling techniques with and without the proposed enhancement.
Srirama et al. [12] has proposed resource provisioning policy that takes into account lifetime, periodic cost and configuration cost of each instance type to find most optimal combination of possible instance types. The auto-scaling problem is represented as a linear programming model. Solution of this linear programming model will provides optimal number of VMs instances from each instances type that must be added or removed to achieve workload with minimum cost. Unfortunately, linear programming model can provide solutions for small number of VMs and cannot deal with large systems.
Hirashima et al. [13] have proposed threshold based autoscaling mechanism that proactively adjusts resource to fulfill incoming workload based on predicted workload. Autoregressive Integrated Moving Average model has been exploited to forecast future workload. Moreover, the proposed mechanism reactively adapts virtual resources if unpredictable workload arrives. However, performance of the proposed mechanism has not been evaluated with unpredictable workload.
Khatua et al. [14] have proposed threshold based autoscaling algorithm that adopts virtual resources proactively according to predicted workload. The proposed algorithm predicts workload by using Auto-regressive Integrated Moving Average (ARIMA) model.
Nikravesh et al. [22] have proposed auto-scaling system, which predict workload using two time-series prediction algorithms: Support Vector Machine (SVM) and Neural Networks (NN). The proposed system automatically switches between SVM and NN based in patterns of workload. SVM is used with periodic workload patterns while NN is used with unpredicted workload pattern. Although, functional dependency is an important factor in predicting workload, functional dependency has not been considered during predicting future workload. www.ijacsa.thesai.org Liao et al. [23] have proposed dynamic threshold based auto-scaling strategy for Amazon web services. The proposed strategy adapts thresholds according to demand for resources. Upper threshold is set in the range 50%-75% and lower threshold is set to the range 5%-30%. Upper and lower thresholds are adapted proportionally with expansion process of VMs.
Tang et al. [24] have proposed reinforcement learning based auto-scaling algorithm. Workload is categorized into normal workload (daily busy-and-idle workload) and burst workload. Auto-scaling problem is model as Markov Decision Process (MDP) model and Reinforcement Learning is applied to decide time to scale up or down and to decide number of VM instances to be added or removed.
Chen et al. [25] have proposed hybrid auto-scaling mechanism. The proposed mechanism predicts next CPU usage rate based on historical data by applying several time series techniques such as Autoregressive-Moving-Average model, Autoregressive model, Exponential Smoothing model, Moving Average model, and Naïve model. The proposed mechanism reactively scales resources to minimize affects of wrong workload prediction.

III. SAAS APPLICATION MODEL
This paper deals with SaaS applications that cannot be encapsulated in one VM and are developed using Service-Oriented Architecture model. Each service is deployed in a separated VM instance and can be scaled up or down by adding or removing VM instances. Each VM has a fixed processing capacity, which is divided into equal parts among all tasks (Processor Sharing (PS)). Thus, each task's service time depends on the total number of tasks that exist at the same time. No task can run simultaneously on more than one VM. Therefore, if number of tasks is less than number of VMs for a specific service, each task is processed by a single VM and the remaining VMs are idle. If number of tasks is greater than number of VMs, tasks are processed according to processor sharing discipline. In this paper, the term "web service" will be used to refer to service component in SaaS application.
Each web service receives requests from one or more upper web services and it can complete tasks by itself or by sending requests to lower web services. After receiving responses from lower web services, request will be completed and sent to upper web services as a response to its request. Web services receive requests from different types. Each type has its arrival rate, process rate, routing, and deadline. Requests from the same type are collected in a chain. A chain contains a set of classes to represent different processing phases for a specific type. Classes are distributed among different web services, and each request moves between these classes during it life.
For example, suppose we have a web service (node ) with upper web services (nodes ) and lower web services (nodes ) (see Fig.  1). According to processor sharing, if there are requests in node at time , service time for these requests will be decreased by ⁄ per unit of time. Total number of requests that are served in node at time is calculated as: ∑ where is the number of requests of class that are served in node , .
Node receives classes of requests from upper web services and sends requests to lower web services synchronously or asynchronously. In Fig. 2, node sends asynchronous requests to nodes and . Chain 1 describes routing behavior of type 1 requests. Request visits node M+1 in class a, node M+2 in class b, node M+1 in class c, node M+3 in class d, and node M+1 in class e.
In some cases, node needs to use two or more nodes synchronously to complete specific request. In this case, several sub-requests are generated, processed in parallel, combined to one request, and sent back to node . In Fig.  3, node sends synchronous requests to nodes and . Fork node represents decomposition of request to two or more sub-requests, which will be processed in parallel by and nodes. Synchronizing node represents buffer that holds completed sub-requests until it can be recomposed with sub-requests from other sibling nodes. Join node represents recombination of completed sub-requests to one request again.  In multiclass M/M/m processor sharing queuing systems, requests of class arrive to node according to Poisson process with rate and require service time with exponential service process. Each class of requests has a deadline . Arrival rates and service times are all assumed to be mutually independent. Deadline of each request class is specified according to required Service Level Agreement.
Request that is completed at node will be sent to node from upper nodes (nodes 1, 2, .., M), if it is completely finished. All nodes will receive responses from other services for their requests. Request will be sent to node from lower nodes (nodes ), if it still requires more processing. Request will be sent from node to node itself, if there is new program path. If deadline of any request expires, this request will exit the system, so that ∑ where is the probability of sending requests from node of class to node of class .
In root service, arrival rate of each request class is observable and can be measured easily. Probability can be specified by SaaS application providers based on business process workflow of their applications.
According to Burke's Theorem [26], the departure process from a queue is Poisson, splitting a Poisson process randomly gives Poisson processes, and sum of Poisson processes is a Poisson process. Therefore, is Poisson.
In steady-state, total required service time from node at time is calculated as ∑ Service time: while arrival time and departure time of each request class are observable and can be measured easily, service time of each request class is not observable and cannot be measured easily (due to processor sharing). Therefore, service time of requests of class that arrive to node can be calculated as following (with assuming homogeneity of servers) ∑ where is observed arrival time of request of class to node , is observed departure time of request of class from node , is number of running servers in node at time , and is total number of requests that are served in node at time .

Number of required servers:
processing sharing does not consider deadlines of request classes and gives the same amount of processing to all requests. Therefore, number of required servers at node to achieve incoming requests without violating Service Level Agreement is calculated as ∑ where is the number of servers in node , is the minimum deadline of all request classes.
Service rate: with servers, node delivers service to requests of class at a rate of where is total number of requests that are served in node at time .
is number of requests of class in node .
Utilization: utilization of node at time , which is the fraction of time the servers in the node are busy, can be approximated to ∑ Throughput: Throughput of node from class at time is calculated as in [27] ∑ ∑ Total throughput of node is calculated as ∑ Service size: if the system is in steady-state ( ∑ ), the probability of existing requests of classes can be calculated as in [27] ( ∑ ) ∏ and is the number of requests of class that are exist in the system at time . where is the probability of responding after exactly for request with remaining service time from node , which contains requests. The probability can be calculated by applying Random Quantum Allocation approximation model proposed by Braband in [28]. Request will leave the system immediately if its service time is finished. Therefore, { If remaining service time is greater than zero, the probability of responding after is calculated as following: where is average arrival rate. is time slice length, which is equal to time unit in this model.
is number of requests that can be accepted by node, which already contains requests. is probability requests leave node that contains requests. is average service time.

IV. EVALUATION
To evaluate performance of the proposed model, threshold based auto-scaling algorithm (without workload prediction) proposed By Shahin in [29] has been implemented with and without the proposed model. Several web applications have been modeled using Cloudsim simulator with NetworkCloudSim. NetworkCloudSim is an extension of CloudSim to support modeling of generalized applications such as High Performance Computing (HPC), e-commerce, social network and web applications. For each application model, different chains have been defined and requests to each application are generated according to ClarkNet trace [30]. Fig. 4 shows model of sample application with 6 services. Each service has been deployed to a separated VM. During run time, number of running VMs in each service is ranged between 1 and 83 VMs. As shown in Table 1, 6 chains have been defined with 20 classes. Table 2 shows classes of each service. According to Table 1 and Table 2, the following probabilities are set to ones:  Remaining probabilities are set to zeros.
As shown in Fig. 5 and Table 3, the proposed model improves number of completed requests, which reduces violation of Service Level Agreement and increases revenue. During run time, total number of running VMs is ranged between 6 and 415 VMs. By considering functional dependencies, VMs are added in advance to achieve incoming requests.
Implemented algorithm is a reactive algorithm. Consequently, it requires around 10 minutes to add new VM instances [30]. For example, if the first node is detected as over utilized due to large number of requests from chain1, without using the proposed model it will take around 30 minutes to be ready to response. This is due to adding VMs sequentially to nodes 1, 2, and 4. While, it will take around 10 minutes only if the proposed model is applied because VMs will be added to nodes 1, 2, and 4 concurrently. Therefore, the proposed model does not effect by number layers in applications. On the other hand, scaling without considering functional dependencies increases Service Level Agreement violation due to long sequence of scaling actions. Fig. 6, Fig. 7, and Fig. 8 show number of completed requests by applications contain different numbers of layers. As shown in these figures, delays of scaling up applications that do not apply the proposed model are proportional to number of application layers. Nowadays, several applications have been moved to cloud computing to benefit from its features. Cloud computing provides a large pool of resources that can be provisioned and release on demand. Some applications are small and can be encapsulated to a single VM. While large applications (such as www.ijacsa.thesai.org social network) are distributed into several VMs. Although, functional dependency between services that are deployed to separated VMs has to be considered during application scaling, most of current scaling techniques do not consider functional dependency and scale services individually. This paper has modelled SaaS applications as multiclass M/M/m processor sharing queuing model with deadline to consider functional dependencies and requests' types during estimating required scaling resources. Based on experimental results, this paper concludes that modeling functional dependencies as multiclass M/M/m processor sharing queuing model improves performance of scaling algorithms.
In the future, the proposed model will be extended to include multiclass with different weights to represent different priorities that can be provided to customers.