Multivariate Copula Modeling with Application in Software Project Management and Information Systems

This paper discusses application of copulas in software project management and information systems. Successful software projects depend on accurate estimation of software development schedule. In this research, three major risk factors and their impact on software development schedule are considered. Software development schedule is calculated by COCOMO-II model. Two models are simulated 100000 times, model-I considered dependence among risk factors by T-copula and model-II considered risk factors independent. The comparison of the two risk models revealed that model-II always underestimate the software development schedule while model-I evaluated the software schedule risk accurately. Therefore it is necessary for software development experts to consider dependence among various risk factors. R-package copula is employed to implement the algorithm for multivariate T-copula. Multiplier goodness-of-fit test shows that T-copula is good choice for characterization of dependence among three risk factors. KeyWords—T-Copula; COCOMO – II; software development schedule; risk analysis


I. INTRODUCTION
Software project management and information system consists of multiple activities under the umbrella of software engineering [1].It involves planning, scheduling, budgeting and managing entire software development process.Each activity in a project management consumes time i.e. software project schedule.Further, entire software development life cycle depends on accurate estimation of software project schedule.Beside estimation of project schedule, it is the responsibility of a project manager to identify risk factors that results in project delays or failure.Therefore it is necessary to assess software project schedule accurately.Plethora of literatures is available on risk factors that results in schedule overrun or project failure.For further detail, see [2] [3] [4].
Generally software project manager just estimate project schedule and completely or partially ignore the impact of risks on estimated project schedule, even if project manager consider some risk factors; it is assumed to be independent.Positive and negative dependence may affect software risk severely.For example, it is possible that project manager inaccurately estimate the software schedule to 20 months and project ends in 26 months.This is also possible that one risk may cause another risk to happen i.e. increase in customer requirements during software development duration strongly cause schedule overrun and in turn may lead to loss of key employee.Further, project delays have negative impact on customer satisfaction which results in bad reputation of an organization.This research paper analyzes software project schedule and consider dependence among risk factors by means of copula.
The reason to consider copulas in this research is that most of literatures available on theory of copulas discuss application in Econometrics, Finance and in Insurance [5].Further, very few articles available that discuses copulas in software project management or in information systems.See [6] [7] [8] as examples and the references therein.
The remaining paper is organize as follows: Section 2 gives an overview to the theory of copulas.Section 3 discusses software schedule and associated risks.Section 4 discusses application of copula model for software houses based in Karachi and finally this research end up with Section 5 that discusses overall conclusion.

II. THEORY OF COPULAS
The discovery of copula is associated to the seminal research work of Sklar [9].He derived the word "copula" for multivariate joint distribution that links to its marginal distribution.In this section, we present brief overview to the basic theory of copulas for higher dimensions.For further details and proofs about copulas and its historical development, please see [10] [11] [12] [13] and the references therein.

A. Multidimensional Copulas
A d-dimensional copula or d-copula is a function "C" from I d to I where I ϵ [0, 1], if and only if it satisfies the following conditions: (i) For every m in Id, C (m1, m2… md) = 0 if at least one of m is equal to zero.
(ii) If all m's are set to one then for every k ϵ {1, 2… d}, C (m) = m k .www.ijacsa.thesai.org All above properties of d-copula are multivariate version of bivariate copula.Property (i) defines grounded condition, property (ii) defines copula margin if (d -1) variables are known, and property (iii) defines C volume of rectangle [a x b] for any d-dimension.Further, it is easy to show that, any convex linear combination of copulas is a copula i.e. ∑ for all α i > 0 and Σα i =1 [13].Nelsen and Joe discusses some other important properties of copulas that we state as following theorems 1, 2, and 3 below [11] [13].
Theorem 1 For every d-copula, the Frechet-Hoeffding bound inequality is given by: Where and represents Frechet -Hoeffding lower and upper bounds, these bounds define as; (∑ ) and ( ) respectively.Notice that, for d > 2, the lower bound is not a copula.For further details see theorem 3.6 in [13].For d = 2, the graphs of and are given in figure 1 and figure 2 below respectively.

Theorem 2
For all m in I, the copula is independent if and only if, For all d ≥ 2, the continuous random variables X 1 , X 2 … X d are independent if and only if their d-copula is C d (m).For d = 2, the graph of independent copula is given as figure 3: As an illustration, consider the following extended FGM copula for d = 3, It is easy to show that, the above three dimensional FGM copula satisfies the basic requirements of d-copula.
Hence (4) is a copula, theorem 2 hold for d = 3 if dependence parameters α ij equals to zero.For further details about d-dimensional extended FGM copula, see Drouet and Kotz [14].

B. Sklar's Theorem
Copulas are important because of Sklar's theorem [9].According to this theorem, any multivariate joint distribution can be represented by its marginal distribution.Consider X 1 , X 2 … X d be continuous random variables with their joint www.ijacsa.thesai.orgdistribution function J and univariate marginal distribution F i (X i ) = P (X i ≤ x i ), i ϵ {1, 2… d}.Then, there exists a d-copula C d such that, Let J, C d and F i (X i ) be as in above (5), let F 1 (-1) , F 2 (-1) , …, F d (-1) are quasi-inverses of F i (X i ) where i ϵ {1, 2, …, d}, respectively.Then, For the sake of simplicity, let F 1 , F 2 … F d be continuous and differentiable distribution functions.
Then the corresponding density function to  is Where j d (x d ) is the marginal density of F i (X i ), i ϵ {1, 2… d}, and


According to , the copula function C d is uniquely determined and can also be represented as, Where m i = F i (x i ) and corresponding density is c, then we have joint density function .

A. Project Scheduling Risk
Risk is related to future happenings and it has two characteristics one is Uncertainty and other is Losses [1] [15].If the risk is certain to occur then it has positive or negative impact on projects' objectives.In all, risk has two dimensions: probability of occurrence of event and its impact.If risk associated to software project schedule is certain to occur then estimated schedule exceeds deadline which results in financial losses and bad reputation of an organization.This research explores the relationship between risk and its impact on software project schedule.Scheduling risk is the probability of one or more events, if they occur has positive or negative impact on software development duration [16].
There are many uncertain risk factors that affect project schedule severely.However, for this research, three major risk factors are considered that every project manager must face [4].These three major risks are defined below as: a) Imprecise measurement of software effort: Software effort is defined as number of working hours spend on the project.In this research, software effort is expressed as: Incorrect measurement of software effort results in project delays.Incorrect estimation of effort is consequences of lack of experienced manager or inadequate knowledge about estimation tools.

b) Loss of Employees during project:
It is the usual turnover rate of employees during projects.It express simply in percentages.It includes resignations, death, medical leave, retirement or transfer of employees during project.

c) Change of Customer requirements:
Change of customer requirements includes Increment or decrement of customer requirements during software development duration and expressed as percentages.

B. Risk Assessment Model
Many risk assessment techniques exists.I will assess software development risk by risk assessment model.For this research, the following schedule risk model is considered: Where R 1 indicates impact of imprecise estimation of software effort, R 2 indicates impact of risk of Loss of Employees during project and R 3 indicates impact of Change of customer requirements during software development period on project schedule.As can be seen, the model is multiplicative in nature.For further details about this model, see [16] [17].The probability distributions for these three risk factors are derived by using expert data from various software houses based in Karachi.Project cost and schedule is estimated by COCOMO-II model [18].

C. COCOMO -II
Boehm et al. [18] proposed a COCOMO-II model and it requires three stages to estimate software project cost, effort and schedule.For early stages of software project development, application composite model is used.When information packages, software architecture and infrastructure is finalized, early design model is used.Finally, post architecture model is used during software development duration.
We have calculated new software project schedule for 400000 SLOC using COCOMO -II.By setting variables to desire level, the COCOMO-II model estimated new software development schedule from 17 months to 40 months.Therefore, the new software development project can be completed at least in 17 months and at most 40 months while the actual schedule for this project was 30 months.This minimum and maximum duration represents best and worst case.The levels of COCOMO-II model for new software development schedule is set by consulting software development experts.

D. T -Copula Method to Model Dependence
Let n-dimensional random vector Y = (Y 1 , Y 2 , … , Y d ) T has d-variate t-distribution with ʋ degrees of freedom, µ mean The contour plot shows moderate dependencies between variables.This is the beauty of copula that even if the correlation is zero nevertheless marginal distributions are related to joint distribution [9].

IV. APPLICATION OF COPULAS IN SOFTWARE PROJECT MANAGEMENT
In this research, two risk models are considered for new software project schedule.Model-I considered dependence among risk factors and Model-II assumes that risks are independent.For model -I, we have used multivariate T-Copula to model dependence among risk factors.Monte Carlo method is employed to simulate the two models.Simulations for both model executed 100000 times.

A. Distributions of Three Risk Factors
The results of copulas are useful if and only if we have fitted best distributions to marginal distributions.These marginal distributions are for Software Effort, Loss of employees and change of customer requirements during project.The marginal distributions for the three risks are derived using data of several software houses based in Karachi.Several statistical hypothesis testing tools and graphs are employed to assess goodness of fit for three marginal distributions.The results for goodness-of-fit tests for sample size 200 are provided in the table 1, 2 and 3.All goodness-of-fit tests listed in the above table3 provide high p-values at 1% level of significance.Therefore, risk distribution for change of customer requirements conforms to Weibull distribution with parameters (1.56949%, 27.27296%).

B. Simulation Results of the Two Models
As described above, two risks models are considered.Model -I considered dependence among three risk factors by T-copula and model -II considered risk factors are independent.Both the models are simulated 100000 times and their histograms are presented in the fig. 5 and fig.6 respectively.In the figure 5 above, the left y-axis represents probability density function and right y-axis represents cumulative density function for the simulated model-I.According to the above simulated histogram, the new software project schedule can vary from 29 months to 31 months.There is 100% chance that, the new software project can be completed in almost 30 months while there is approximately 38% chance that the new project can be completed in almost 29 months.
Multiplier goodness of fit test [21] is applied to the chosen T-copula.The goodness of fit provide test statistics = 0.036957 with p-value 0.1004.Since p-value is high enough at 1% level of significance.Therefore the chosen T-copula model is appropriate for characterization dependence among three risk factors.Simulation histogram for model -II is shown below in the figure 6.The simulated model-II shows that the new project schedule can vary from 27 months to 29 months.There is 100% chance that the new project can be completed in 28 months and less than 5% chance that the new project can be completed in 27 months.
The comparison of the two simulated risk model revealed that the new software schedule is underestimated without considering dependence among three risk factors.Further comparison of the simulated result revealed that, there is almost 100% chance for new project to be completed in 30 months in risk model-I and 100% chance to be completed in 28 months in risk model-II.There is almost 40% chance for the new software project to be completed in 29 months in risk model-I and almost 5% in risk model-II.The original duration for new project is 30 months.Its mean that the model-I which considered dependence among risk factors by T-copula evaluated software project duration accurately.V. CONCLUSIONS Project delays or failures are practicing routine in many software houses across Pakistan.In this research, we have considered three major risk factors that can negatively impact the estimated project schedule.The risk factors are evaluated by two models.Model-I assumed dependence among risk factors by multivariate T-copula and model-II assumed independence among risk factors.Both models implemented for some software houses based in Karachi and the analysis revealed that model -I which considered dependence among risk factors by T-copula, evaluated project schedule accurately.Multiplier goodness-of-fit test showed that, the chosen T-copula is appropriate for characterization of dependence among three risk factors.It is concluded that if a software manager do not consider dependence among risk factors then he may underestimate the software project schedule.Schedule overruns result in high budgeting cost, dissatisfaction of customers and sometimes failure of software project.Therefore copulas are important for characterization

TABLE I .
PROBABILITY DISTRIBUTION FOR SOFTWARE EFFORT (R 1 ), H O : R 1 CONFORMS TO NORMAL DISTRIBUTION All goodness-of-fit tests listed in above table 1 provide high p-values at 1% level of significance.Therefore, the risk distribution for imprecise measurement of software effort conforms to normal with parameters (40.49525%, 20.79036%)

TABLE II .
PROBABILITY DISTRIBUTION FOR LOSS OF EMPLOYEES (R 2 ), H O : R 2 CONFORMS TO WEIBULL DISTRIBUTIONAll goodness-of-fit tests listed in the above table provide high p-values at 1% level of significance.Therefore, risk distribution for loss of employees during project conforms to Weibull with parameters (1.478794%, 27.747935%).www.ijacsa.thesai.org

TABLE III .
PROBABILITY DISTRIBUTION FOR CHANGE OF REQUIREMENTS (R 3 ), H O : R 3 CONFORMS TO WEIBULL DISTRIBUTION