Towards a Framework for Multilayer Computing of Survivability

The notion of survivability has an important position in today enterprise systems and critical functions. This notion has been defined in different ways. However, lacking a comprehensive and multilayer model for computing the survivability quantitatively, is the major gap happened in researches of this field; a model that is tally general and applicable in various applications. This research tries to design a comprehensive, multilayer as well as general model for modeling and computing the survivability. Considering that the Markov property is true in our proposed model, we used the Markov model. Using the proposed three layer architecture and designing a Markov structure, we could have been able to compute the survivability initially for each of infrastructure components separately and regardless of their functional dependency to each other. The computations were generalized to consider component dependencies as well as the upper layers entering dependencies in Markov model and could compute the survivability of each vital function for the highest architectural layer based on the underlying layers. Finally, a common and ordinary structure of crisis management has been studied and its results analyzed. We could examine the abilities of our model to compute the survivability of the whole crisis management system successfully. Keywords—Network survivability; survivability quantification; survivability computation; system survivability


I. INTRODUCTION
Today, all social, cultural, political and economic life aspects of societies and states are dependent on the information technology infrastructures and this dependency is ever increasing.Due to this dependency, important concerns on the functional quality and serving those infrastructures have been emerged.This issue is becoming more important day by day that whether these infrastructures can tolerate different challenges -including natural ones from flood and earthquake to human errors or adversary invasions-and can provide their major and essential services.Therefore we need to compute the resistance of infrastructures against such challenges for better planning, implementing and utilizing them.This will enable us to find appropriate solutions for improving the resistance property.This is explained by the survivability metric.

A. Qualitative Definitions of Survivability
Like many other scientific subjects, there is no consensus and unanimous definition for the survivability.Table I summarizes definitions yet provided.The definitions have been ordered chronologically and their references have also been given.The definitions are dependent on the field within which are required and their own origins.There are multiple differences between these definitions, so there should be a deep understanding of intended problem to find the more suitable one.In our context, the dominant definition that many other researches have used, is the fourth definition.So we will use it too.

B. Quantitative Definition of Survivability
All definitions contained in Table I have a qualitative approach.ANSI has provided a quantitative definition for survivability [27] that models the survivability concept parametrically.Fig. 1 shows this definition.In this definition, the measure of interest M has the value m 0 just before a failure occurs.The survivability of this system is represented by the following attributes:  m a is the value of M immediately after the failure.
 m u is the maximum possible difference between m 0 and m a after failure.
 m r is the restired value of M after time t r .
 t r is the time required for achieving the value of m 0 for M again or a reduced but acceptable value m 0 if m 0 is impossible to be fully restored.
The notion of survivability may seem similar or overlapping to certain notions of dependability field like reliability, availability, fault tolerance, maintainability, security and safety.These similarities and differences have been discussed in various important references such as [12,24,25,[28][29][30].Thus, we refer the reader to those references.

II. RELATED WORKS
Various researches have been performed on the survivability implicitly and explicitly.By the explicit researches, we mean those that have been clearly focused on the survivability.However, implicit researches are those dealing with related concepts like system recovery or intrusion tolerance.Moreover, some of them have just provided a qualitative model in this field and have not indicated a way to inferring the level of survivability from such models.Others have attempted to make the issue quantitative and compute the survivability.www.ijacsa.thesai.orgSurvivability is a property of a system, subsystem, equipment, process, or procedure that provides a defined degree of assurance that the named entity will continue to function during and after a natural or man-made disturbance. [15] 3 Network Computing Systems 1997 Survivability is the ability of a network computing system to provide essential services in the presence of attacks and failures and recover full services in a timely manner.[3] 4 Critical and Defense Systems 1999 Survivability is the capability of a system to fulfill its mission, in a timely manner, in the presence of attacks, failures or accidents.Survivability is the ability [of a system] to continue to provide service, possibly degraded or different, in a given operating environment when various events cause major damage to the system or its operating environment.SABER model in [13] dealt with providing an appropriate architecture for intrusion tolerance in the systems.This architecture has a conventional network security approach enabling it to continue the wanted services under an intrusion or attack using IDS sensors and a higher security level called SOS.This research includes only software attacks and has nothing covering other malicious and non-malicious undesired events.
ITDOS architecture [14] provided an intrusion-tolerant software structure for software systems using facilities based on CORBA firmware.Thereby, it is ensured that all CORBAbased softwares produced with proposed extensions are intrusion-tolerant.However, no review and implementation of mentioned architecture has been reported.
AWDRAT method [18] provides a self-adaptive method in software firmware to be able to detect the possibly compromised point comparing operating program behavior with the desired behavior.Then, a trust management system manages the restoration process changing the execution path from the previously compromised components to the unaffected ones to enable the system continue secure and trusted operation under attack conditions.Although this research had successful experimental results, it has focused merely on the software malicious threats and overlooked other threat aspects and is not fully comprehensive yet.
DPASA architecture [17] provides a model for system state recovery after a cyber attack.It uses a set of tools and methods for identifying, protecting and adaptive reactions.For assessing the model performance, it was entered in an applied example of JBI belonging to US Air Force laboratory and has modeled and represented accuracy of recovery and attack tolerance with several parameters.However, the parameters introduced in this research are not public and can not be generalized to other problems.Thereby, any new problem will need its own parameters to be extracted.This research emphasizes only on the cyber attacks and is also general and qualitative.Qualitative means it does not have any measureable parameters.
Willow architecture [8] claims creating survivability in wide critical and distributed systems.This architecture uses a combination of fault avoidance, fault elimination and fault tolerance.It disables vulnerable components under threatening conditions.Then replaces damaged components after a fault or intrusion.When an indispensable fault appears it will be able to strengthen the system against that using reconfiguration methods based on a control system feedback.This architecture reported an experimental study for the US air forces on the JBI basis whose results have not yet been published and has sufficed just the claim that the system has functioned successfully.
The previous works reviewed here were qualitative models that provide no accurate computations.Trivedi et al. in [16] and [26] have provided a model for computing survivability quantitatively that has been designed based on composition of availability and performability.They provided two Markov models for availability and performability in their research.www.ijacsa.thesai.orgThen, they combined them and devised a complex Markov model for system survivability.For verification of the provided model, they modeled it on a given telecommunication system and computed the survivability supposing that the input variables of the model have been taken.This model has the advantage of being quantitative and computable.However, the research finally gives no accurate sense to the software service users on that how they can verify their service survivability.Moreover, the supposed problem in this research is a very wellknown and already solved problem with clear solutions while imagination of such clarity for broader and more complex problems is generally difficult if not impossible.For this, we can say that the proposed method is slightly difficult to generalize to other problems in this field.
In Survivability Analysis of a Computer System under an Advanced Persistent Threat Attack, [31] has attempted to model and compute a software system survivability under APT attacks.For this purpose, they proposed an integrated model of process of an APT attack as well as different steps for defending it.Their final goal was to create a continuous time Markov model for this issue to compute the total survivability of the system under an APT attack.To create the model, different steps of the ATP attack have been modeled with Stochastic Reward Nets and its graph has been produced.Then, the reachability graph of this petri net is drawn as a continuous time Markov model for computing survivability.The graph introduces system recovery, system reachability, data confidentiality, and data accuracy as the four parameters of the survivability model to compute the survivability of this system.SRN net and Markov model are created here for computation of the aforementioned four parameters.Finally, to be able to verify the model, authors have obtained some of probable values required for the model from the valid references or supposed them and applying the values to the model.So, they computed survivability quadruplet probability measures.While an appropriate computation has been proposed in this model for survivability, the proposed model is allocated to APT cyber attacks and isn't suitable to be applied to other applications.
For computing the survivability generally in software layer, [21] attempted to provide measurable criteria for defining and assessing software survivability from the end user's viewpoint.Doing so, they have provided a framework for defining software survivability quantities and enabling the user to design and execute various policies for achieving survivability based on those quantities.A decision support model has also been proposed to realize the survivability quantities to ensure the minimum survivability for the software.The aforementioned quantities are classified into five groups as follow: Each group represents one of the characteristics of the survivability and each has several quantities that are survivability related quantities.Finally, survivability computations are classified into two groups: contributionoriented and concern-oriented.The contribution-oriented functions compute those characteristics of the survivability that the user needs them essentially and must be met fully.In contrast, the concern-oriented functions deal computation of those characteristics and quantities of the survivability on which the user is concerned about but can tolerate violation of them up to a certain level.
The same author in [22] and [23] has attempted to use proof-carrying codes for survivability assessment.The general idea of this method is to enable the user to define his software survivability requirements and provide it to the software vendor.Then, the vendor will be able to provide the user with a system using proof-carrying code method that enables the user himself to assess his system survivability based on the initially proclaimed requirements.The main reference for introducing proof-carrying codes is [32] which is used in this method.
All works reported here from Dr. Zuo have computed and parameterized the survivability only based on outstanding characteristics of the software system itself.While the secure and correct execution of any software system is subjected to the security and correctness of infrastructure components performance that the system relies on them.Unfortunately, these researches have not discussed them and have not replied the ambiguity here.

III. BASIC ARCHITECTURE
As mentioned in section 2 the general and widespread weakness in all works of this field was that the system user can not compute the overall system survivability based on his information about different layers.Some of works have dealt with computing the survivability of the infrastructure layer without enabling the user to use it for computing service/software survivability.Some of them have performed it in software layer without taking into account a logical and working dependency between the software layers with underlying layers.Naturally, these computations are not comprehensive and do not have enough accuracy and integrity.A suitable model is required for survivability computation that connects layers to remove this challenge.Fig. 2 shows this model.
The provided model is a set of various and heterogeneous agents and components that are set up beside each other randomly and unpredictably and each component can be connected to others and there is no predefined limitation for services that is provided to other components.Of course it is clear that we do not mean the practical limits like memory, connection link capacity, etc.Each of those components participates in one or more application belonging to the software layer.The total system is depicted as a set of functions or services in the top layer namely operation level.In this layer we deal with organizational processes as functional components of that layer.Functions are executed using several applications.In other words, each function need some applications for operating.In the given model of Fig. 2, the system is supposed to have X functions that use n applications for fulfilling their functions and services.Application systems are executed on the basis of k components.www.ijacsa.thesai.org

C. Relations in the Model
The relationship between components must be understood and analyzed accurately to make the model efficient and practically useable.The components relationships to each other in the infrastructure layer is transversal while the relationship of infrastructure layer components with software layer applications and between software layer applications with functions of the operation level is longitudinal.In a real environment the components can serve each other.Therefore, it is required that to suppose the relationship between components a directional relationship for demonstrating that which component is client and which one is service provider.
Given the directionality of the graph, it must be cleared that does the graph have a loop or would be a DAG?Although it is acceptable to suppose that this graph can involve a loop, it may be a DAG.In this regard, what is important here is our attitude resolution and granularity.For instance, if a smart building management system is taken as a component then this component can provide the inputs needed for other systems like ventilation, cooling, electricity, etc. where the supposed component has low granularity in this situation.If the BMS system is separated to its basic facilities and modules and each module is considered as a component, then that components will be single task that lead us achieving a loop free graph.Whether or not, we supposed the graph of components as DAG and provide our algorithm based on it.Although this decomposition process helps to achieve a DAG, it is obvious that appropriate algorithms can be developed in future works considering the graph a cyclic one.
In the upper layer, it is possible to consider no direct dependency between them because when system A serves system B it means that some of components in system A serve some of components in system B. Indeed, this concept is considered in relationship between components.Therefore, there is no explicit transversal dependency between applications and independent set of applications form a function in operation layer.Fig. 3 depicts this notion.Now, we must analyze dependency of components to each other, dependency of applications to components and dependency of functions to applications separately and quantify them.Doing so, the model edges are named according to fig. 2.Moreover, for the sake of facilitation in representing topics, the applications are symbolized with AP i , functions with FS i and components with CMP i .In this model, α i,j represents the total dependency of FS i to AP i .Further, β i,j represents dependency of AP j to CMP i .γ x,y represents the dependency of component y to component x. α, β and γ coefficients are real numbers between 0 and 1.Now, we discuss properties of these coefficients in the graph of Fig. 3.
Because, each function FS j is consisted of its applications and regardless of user mistakes, the full execution of applications means that the function FS j will be executed completely.
Meaning that the full operation of any application is subjected to the fact that all concerning components fulfill their tasks completely, because each system only is consisted of its components functioning well and no other components intervening correct application execution.
Meaning that each component would be partially -and not fully-dependent on other components functionally.In fact, each component definitely has its own special and independent functionality that cause the above summation should be less than 1.If the required inputs for a client aren't provided from one of the service provider components, the function of client www.ijacsa.thesai.orgcomponent will be damaged proportionate to coefficient of dependency to the service providing component.

IV. SURVIVABILITY BASIC MODEL
In this section we propose our basic conceptual model for the survivability of any system generally.As we saw in section 1, the survivability aims at enabling the system to continue its vital and essential services and operations under crisis until recovery of failed subsystems.Thus, for modeling the survivability of any system it is required to consider three basic states.The first state is where the system operates normally and naturally.Under such state, the crisis has no degrading effects on the system and operates normally that is called Healthy state.The other is loss of the important and critical subsystems that results in the total failure and break down and is called Fail state.However, the third state is one that some of non critical subsystems are failed but the system can continue its fundamental operation until the problem is removed.This state is called Survive state.Tri-state Markov model in Fig. 4 represents these definitions.In this model, μ and ρ parameters show the transmission rate between various states of Markov model.

V. COMPUTING SURVIVABILITY
For multilayer computing the survivability across the layers of Fig. 2, it is required to start from the lowest layer and compute it separately for each infrastructure component regardless of its dependency to other components.Then, the infrastructure layer components survivability is computed taking into account their dependency.In the next step, the applications survivability in software layer is computed given their dependency to the infrastructure layer components and computations in that layer.Finally, the functions survivability in the operation layer will be computed based on computations of the software level.

A. Computing the Survivability of a Single Infrastructure Component
The model depicted in Fig. 5 is Markov model for survivability of a single component of infrastructure layer.attr i is an attribute or subsystem of the component and αi is the probability of failing any attr i .Some attributes or subsystems are critical for basic functioning of the component while others are not.The component could not tolerate failure of critical attributes and the component will enter the fail state.In the case of failing non critical attributes or subsystems, the component can continue its essential functions while entering the survive state.We show critical attributes with * mark in Fig. 5.In the model shown in Fig. 5, each property of attr i has a bi-state Markov model as represented in Fig. 6.In this model, λ i and μ i are failure and recovery rate of the property i.In transient state, the probability of healthy and fail states in Markov model is computed as follow: In Eq. ( 4), op i  and f i  mean the probability of healthy and failure states for the property i that are symbolized as α i and 1α i in Fig. 5 for the sake of facilitation in reading and writing.
The number c is an arbitrary constant.Thus, we have: In the steady state, the probability of healthy and fail states is as follow: Markov chain Attributes and their probabilities www.ijacsa.thesai.org In model shown in Fig. 5, values assigned to α i are probability type while values of ρ and μ are rate.On the other hand, values of α i are given and known already.Therefore, probabilities of Markov model tri-states must be obtained first to compute rates of ρ and μ.Then, ρ and μ are computed based on the probabilities of three states.For this purpose, three sets are introduced for using in Eq. (7).S is a set including all properties of this component.The set IC is a subset of critical characteristics of S and the set INC includes non-criticals.Following section shows formulae for the survivability computation.
Given the practical conditions in this model, it is possible to consider all properties independent.Even with some of properties depending on each other practically, the desired independency can be obtained through changing the system design.According to this assumption, the probability of a fully healthy state equals to multiplication of all properties healthy states probabilities that is shown in Eq. ( 7)(a).for computing the probability of survive state, the failure probability of noncritical properties are considered with their different permutations and multiply it by the probability of critical properties healthy probability.This is shown in Eq. ( 7)(b).However, the probability for the fail state equals the state within which some of critical properties are failed regardless of whether non-critical properties are healthy or not that is shown in Eq. ( 7)(c).
The probabilities related to states of Markov model of Fig. 5 have been computed in Eq. (7).Now, we should prove that summation of these three states equals 1 according to Markov model conditions.In other word, following equation must be true.

  
Theorem: prove that the following equation is true in Markov model of Fig. 5: Proof: first, for simplification of notations we define: Now, given the sections (a) to (c) of Eq. ( 7): The P(S) in the final result of Eq. ( 8) is the power set of S. Indeed, the final result in Eq. ( 8) contains all possible permutations of failure or healthy state probability for each of properties through a linear polynomial.Now, it must be proved that the last sentence of Eq. ( 8) equals 1.To do so, the mathematical induction method is applied.For base case S must have two members.We know that sum of two elements of S is 1.So: Therefore the theorem for S with two members is true.Now, suppose that for S with n members the desired sentence equals 1.It must be proved that the relation is also true for S with n+1 members.n S represents the set S has n members.Thus, we have: www.ijacsa.thesai.org Therefore, the sum of probability of three states will be always equal to 1. Now, combining formulae in Eq. ( 4) and Eq. ( 7) for computing probabilities related to the model of Fig. 5 in transient state, we have: Probabilities of the steady state in Fig. 5 are as follows: (12) adad In this model, the number of failures of the system over time t is obtained from Eq. ( 13): We assume ρ 2 = ρ 3 in Eq. ( 13) because they implicitly describe an equivalent rate.ρ 2 is the rate of transmission from survive state to failure state, but ρ 3 is the rate of transmission from healthy state to failure state.Actually, both ρ 2 and ρ 3 describe the rate of failure of critical subsystems of Fig. (5).So, assuming them to be equal can be correct.

B. Survivability Propagation Model of Dependent
Components in Infrastructure Layer At this step, we suppose that a technical component CMP i is functionally dependent on components C 1 to C n .Thus, while the CMP i has its own independent survivability, its final survivability also depends on survivability of C 1 to C n with coefficients  and .So, we must try to compute survivability of CMP i based on C 1 to C n survivability along with its own independent survivability.This process is called propagation in our notation.This is represented in Fig. 7.
Coefficients  and  in Fig. 7 are obtained by the Eq. ( 14).γ x,i used in this formulae shows the dependency coefficient of CMP i to C x and has been taken from Fig. 3.In Eq. ( 14), x C H  shows the healthy state probability of infrastructure component C x that CMP i is dependent to.
Now we define following sets for computing dependent component CMP i survivability.
Based on coefficients  and  we compute final survivability of CMP i through Eq. (15).In Eq. ( 15

C. Comprehensive Model for Multilayer Survivability Computation
Now we are completely ready for developing our model toward the multilayer computing of survivability.To do so, as we mentioned at the beginning of section 5, we shoud compute the survivability of applications of software layer based on finalized survivability of components.Then we compute the survivability of operation layer functions based on applications survivability of each function.In other words, we must propagate the survivability of infrastructure layer components to software layer applications.Then propagate the survivability of applications to operation layer functions.This process exactly follows the propagation method provided in section 5.2.Fig. 8 shows the process.
In Fig. 8, we compute the survivability of applications with respect to survivability of its underlying components that depends on.This process is similar to previous one for calculating survivability of dependant component CMP i .When computation of survivability of all applications is done, then we take them into account for computing survivability of operation layer functions in a similar way.One can say, we propagate from software layer to operation layer.
One important point in Fig. 8 is that for healthy operating of any application, it is enough that each underlying component performing its essential functions only.So we can merge the healthy and survive state of components and name it as operational state as illustrated in right portion of Fig. 8.

VI. SURVIVABILITY OF A CRISIS MANAGEMENT SYSTEM
For investigating about the proposed model, we imagined a crisis management system and tried to model it as well.Then we applied the model to the crisis management system for verification of our approach.Based on our studies, we extracted the general model of Fig. 9 for a common crisis management system.We have done a noticeable amount of calculations about all layers and components of the system for calculating survivability, but due to page number limitations we are not able to present all of them.Each interested reader can achieve them by email.Only for representing the achieved results at final stage, we present the calculated survivability of two critical processes OP1 and OP2 in Table II. .Now, we are ready for computing the survivability probabilities of the total crisis management system overally.For this purpose, the total health probability of the crisis management structure is symbolized This paper provides a general multilayer structure for systems survivability computation that is extendable to all common organization systems and operations.We designed a three layer model that connects the operational processes to application systems and application systems to the infrastructure layer.Then, the dependencies among these layers have been studied vertically (interlayer) and horizontally (intralayer).On the other hand, a new conceptual model was provided based on the Markov model characteristics for www.ijacsa.thesai.orgsurvivability.Then, this model was used in a three stage structure for achieving our goal.In the first stage, the survivability of an infrastructure layer component was computed regardless of any dependencies and independently.Then, the horizontal dependency between the infrastructure layer components was entered in the computations and the survivability was computed applying those dependencies.In the final stage, the survivability computation model was provided taking into account the vertical dependencies for upper layers.The survivability of application systems and finally system operational processes have been computed including these dependencies.Finally, applying the total model in an important and frequently used problem such as the crisis management system, we could compute the real value of survivability for such system in the level of the crisis management critical and major processes and presented the abilities of our model.Utilizing this model will result in enabling the managers and planners to detect system weak points that make the highest loss in the survivability and efficiently protecting and retaining the system critical functions in crisis condition.

Fig. 5 .
Fig. 5. Combined Markov Model of Survivability Quantification of a Single Component.

Fig. 7 .
Fig. 7. Survivability Propagation in Infrastructure Layer of Model Among Dependent Components.

Fig. 9 .
Fig. 9. Model of the Given Crisis Management System.
operation continuation in the failure conditions of non-critical process as Overall S  and the probability of failure of total crisis management system as Overall F  .It is supposed that the computation processes performed for OP1 and OP2 are similarly performed for OP3 and OP4.Since we want to find acceptable states operationally for processes,

TABLE I .
DIFFERENT DEFINITIONS PROVIDED FOR THE SURVIVABILITY

TABLE II .
COMPUTATION OF CRITICAL OPERATION LAYER