A Copula Statistic for Measuring Nonlinear Dependence with Application to Feature Selection in Machine Learning

Feature selection in machine learning aims to find out the best subset of variables from the input that reduces the computation requirement and improves the predictor performance. This paper introduces a new index based on empirical copulas, termed as the Copula Statistic (CoS) to assess the strength of statistical dependence and for testing statistical independence. It shows that this test exhibits higher statistical power than other indices. Finally, applying the CoS features selection in machine learning problems, which allow a demonstration of the good performance of the CoS. Keywords—Copula; multivariate dependence; nonlinear systems; feature selection; machine learning


INTRODUCTION
Measures of statistical dependence among random variables and signals are paramount in many scientific applications: engineering, signal processing, finance, biology and machine learning to cite a few.They allow one to find clusters of data points and signals, test for independence to make decisions, and explore causal relationships.The classic measure of dependence is provided by the correlation coefficient, which was introduced in 1895 by Karl Pearson.Since it relies on moments, it assumes statistical linear dependence.However, in biology, ecology and finance, and other fields, applications involving nonlinear multivariate dependence prevail.For such applications, the correlation coefficient is unreliable.Hence, researchers have initiated many proposals in order to address this deficiency [1]- [5].Reshef et al. [6], [7] introduced the Maximal Information Coefficient (MIC) and later the Total Information Coefficient (TIC), Lopes-Paz et al. [8] proposed the Randomized Dependence Coefficient (RDC), and Ding et al. [9], [10] put forth the Copula Correlation Coefficient (Ccor).Additionally, Székely et al. [11] proposed the distance correlation (dCor).These metrics are able to measure monotonic and nonmonotonic dependencies between random variables, but each has strengths and shortcomings [12]- [18].Feature selection in machine learning is a typical battlefield to appraise the quality and the reliability of a dependence index, it is to find out the best subset of variables (from the input) that reduces the computation requirement and feed up the predictor algorithm for optimal performance [19], [20].
In this paper, a new index based on copulas, termed the Copula Statistic (CoS), for measuring the strength of nonlinear statistical dependence and for testing for statistical independence is introduced.The CoS ranges from zero to one and attains its lower and upper limit for the independence and the functional dependence case, respectively.Monte Carlo simulations are carried out to estimate bias and standard deviation curves of the CoS, to assess its power when testing for independence.The simulations reveal that for large sample sizes, the CoS is approximately normal and approaches Pearson's  P for the Gaussian copula and Spearman's  S for many copulas.The CoS is shown to exhibit strong statistical power in various functional dependencies as compared to many other indices.Finally, the CoS is applied to feature selection problem to unveil bivariate dependence.
The paper is organized as follows.Section II proves two new and essential theorems on copulas used to derive the CoS index.Section III introduces a relative distance function and proves several of its properties.Section IV defines the CoS and provides an algorithm that implements it.Section V investigates the statistical properties of the CoS and treats the case of bivariate dependence.Section VI compares the performance of the CoS with the dCor, RDC, Ccor, and the MICe in measuring bivariate functional and non-functional dependencies between synthetic datasets.It also shows how the CoS can unveil statistical dependence in real datasets of breast tumor and proceed with an in-depth analysis in order to find out the best feature subset for this problem.

II. BIVARIATE COPULA
In the following, attention is restricted to two-dimensional copulas to develop a statistical index, the CoS, in the bivariate dependence case.To define the CoS of two continuous random variables X and Y with copula C(u, v), three definitions of bivariate dependencies are provided, from weaker to stronger versions, as introduced by Lehmann [21].Then, three theorems are stated which help build the foundation for the CoS.Definition 1: Two random variables, X and Y, are said to be concordant (or discordant) if they tend to simultaneously take large (or small) values.www.ijacsa.thesai.orgA more formal definition is as follows.Let X and Y be two random variables taking two pairs of values, (x i , y i ) and (x j , y j ).X and Y are said to be concordant if (x ix j ) (y iy j ) > 0; they are said to be discordant if the inequality is reversed.
Definition 3: Two random variables, X and Y, are said to be comonotonic (respectively countermonotonic) if Y = f(X) almost surely and f(.) is an increasing (respectively a decreasing) function.
In short, two random variables are monotonic if they are either comonotonic or countermonotonic.
Theorem 1: (Fréchet [13]: Let X and Y be two continuous random variables.Then, a) X and Y are comonotonic if and only if the associated copula is equal to its Fréchet-Hoeffding upper bound, that is, b) X and Y are countermonotonic if and only if the associated copula is equal to its Fréchet-Hoeffding lower bound, that is, c) X and Y are independent if and only if the associated copula is equal to the product copula, that is, C(u,v) =  (u,v)= uv.
In the following theorems and corollaries, it is assumed that and are continuous random variables and related via a function f(.), that is, Y=f(X), where (⋅) is continuous and differentiable over the range of .
Theorem 2: Let X and Y be two continuous random variables such that Y = f(X) almost surely, and let C(u,v) be the copula value for the pair (x,y).The function f(.) has a global maximum at (x 1 , y max ) with a copula value C(u 1 ,v 1 ) or a global minimum at (x 2 , y min ) with a copula value C(u 2 ,v 2 ) if and only if The proof of Theorem 2 is given in the appendix.For a general definition of the copula, the reader is referred to Nelsen [13].The proof of Corollary 1 directly follows from Theorem 2. This corollary is demonstrated in Fig. 1, which displays the graph of the projections on the (u, C(u,v)) plane of the empirical copula C(u,v) associated with a pair (X,Y), where X is uniformly distributed over [-1, 1], and Y = sin(2πX).It is observed that at each one of the four optima of the sine function, C(u,v) = M(u,v) = W(u,v) = (u,v).Theorem 3: Let X and Y be two continuous random variables such that Y = f(X), almost surely where f(.) has a single optimum and let C(u,v) be the copula value for the pair (x,y).Then, C(u,v) = M(u,v) if and only if df(x)/dx ≥ 0 and The proof of Theorem 3 is provided in the appendix.Theorem 3 is illustrated in Fig. 2.

III. THE RELATIVE DISTANCE FUNCTION
A metric of proximity of the copula to the upper or the lower bounds with respect to the Π copula is defined and its properties are investigated.[0,1], is defined as A graphical illustration for the relative distance is shown in Fig. 3. Theorem 4: (C,(u,v)) satisfies the following properties:

at the global optimal points of f(.).
Proof: Property a) follows from Definition 5 and (3) while properties b), c) and d) follow from Definition 5 and Theorem 1 and 2.
Corollary 2: If Y = f(X) almost surely, where (⋅) has a single optimum, then (C,(u,v)) = 1 for all (u,v) I 2 .www.ijacsa.thesai.orgNow, the question that arises is the following: Is (C(u,v)) = 1 for all (u,v) I 2 when there is a functional dependence with multiple optima, be they global or local?The answer is given by the following two theorems: The proof of Theorem 6 is provided in the appendix.

IV. THE COPULA STATISTIC
The empirical copula is first defined; then, the copula statistic is introduced, and finally, an algorithm that implements it is provided.One possible definition for the CoS is the mean of ].However, according to Theorems 5 and 6, CoS ≤ 1 for functional dependence with multiple optima, which is not a desirable property.This prompts a better definition of the CoS based on the empirical copula as explained next.

A. The Empirical Copula
Let {(x i , y i ), i=1,…, n, n ≥ 2} be a 2-dimensional data set of size n drawn from a continuous bivariate joint distribution function, H(x, y).Let R xi and R yi be the rank of x i and of y i , respectively.Deheuvels [22] defines the associated empirical copula as The empirical relative distance, (C n (u,v)), satisfies Definition 4 by replacing C(u,v) with the empirical copula given by (3).

B. Defining the CoS Statistic for Bivariate Dependence
Let X and Y be two continuous random variables with a copula C(u,v).Consider the ordered sequence,

C. Algorithmic implementation of the Copula Statistic
Given a two-dimensional data sample of size n, {(x j , y j ), j=1,…, n, n ≥ 2}, the algorithm that calculates the CoS consists of the following steps: 2) Order the x j 's to get , where is the rank of x j ; 3) Determine the domains  , i = 1, ... , m, where each  is a u-interval associated with a non-decreasing or nonincreasing sequence of C n (u (j) ,v p ), j = 1, … , n.

V. STATISTICAL PROPERTIES OF THE COS
The finite-sample bias of the CoS is analyzed for the independence case, then a statistical test of bivariate independence is developed.

1) Finite-Sample Bias of the CoS
Table 1 displays the sample means and the sample standard deviations of the CoS for independent random samples generated from three monotonic copulas.As observed, the CoS has a bias for small to medium sample sizes.Fig. 4(a) shows a bias curve given by  = 8.05 −0.74 , fitted to 19 mean bias values for Gauss(0) using the least-squares method.It is observed that the CoS bias becomes negligible for a sample size larger than 500.Fig. 4(b) shows values taken by the sample standard deviation  of CoS for increasing sample size, n, and for Gauss(0).A fitted curve obtained is also displayed; it is expressed as  = 2.99 −0.81 .

2) Independence Test
One common practical problem is to test the independence of random variables.To this end, hypothesis testing can be applied to the CoS based on Corollary 3b).The goal is to test the null hypothesis, H 0 : the random variables are independent, against its alternative, H 1 .The CoS is standardized under H 0 to get Where, n0 and  n0 are the sample mean and the sample standard deviation of the CoS, respectively.Note that as observed in Fig. 4(a) and (b), for a number of samples n larger than 500, n0 becomes negligible and  n0 is approximately equal to 0.01.
Hypothesis testing consists of choosing a threshold c at a significance level under H 0 and then applying the following decision rule: if | | ≤ c, accept H 0 ; otherwise, accept H 1 .Table 2 displays Type-II errors of the statistical test applied to the CoS for Gauss(0) for sample sizes ranging from 100 to 3000.It is observed that Type II-errors decrease as increases for a given n and sharply decrease with increasing n.For monotonic dependence, simulation results show that CoS = 1 for all n ≥ 2. For non-monotonic dependence, there is a bias that becomes negligible when the sample size is sufficiently large.As an illustrative example, Table 3 displays the sample mean, n , and the sample standard deviation,  n , of the CoS for increasing sample size, n, for the sinusoidal dependence, Y = sin(a X).It is observed that as the frequency of the sine function increases, the sample bias, 1n , increases for constant n.
Table 4 displays n and  n of the CoS calculated for increasing n and for different degrees of dependencies of two dependent random variables following the Gaussian copula.It is interesting to note that for n ≥ 1000, the CoS is nearly equal to the Pearson's  for the Gaussian copula and to the Spearman's  for other copulas.

A. Synthetic Datasets
In this section, the performances of the CoS, dCor, RDC, Ccor, and of the MICe for various types of statistical dependencies is compared.Székely et al. [11] define the distance correlation, dCor, between two random vectors, X and Y, with finite first moments as where  2 ( , ) is the distance covariance.Lopes-Paz et al. [9] define the RDC as the largest canonical correlation between k randomly chosen nonlinear projections of the copula transformed data.Ding et al. [9], [10] define the copula correlation (Ccor) as half of the 1 distance between the copula density and the independence copula density.As for the MIC, it is defined by Reshef et al. [6] as the maximum taken over all x-by-y grids G up to a given grid resolution, typically x y < n 0.6 , of the empirical standardized mutual information, (, )/log (min {, }), based on the empirical probability distribution over the boxes of a grid G. Formally,

1) Bias Analysis for Non-Functional Dependence
A bias analysis is performed for the MICe, the Ccor, the CoS, the RDC, and the dCor and using three data samples drawn from a bivariate Gaussian copula with ( , ) = 0.2, 0.5 and 0.8, which models a weak, medium and strong dependence, respectively.The sample sizes range from 50 to 2000, in steps of 50.From Fig. 5, it is observed that unlike the MICe and Ccor, the CoS, RDC, and the dCor are almost equal to for large sample size.

2) Functional Dependence
Another series of simulations are conducted to compare the performance of the MICe, the Ccor, the CoS, the RDC, and the dCor when they are applied to four data sets drawn from an affine, polynomial, periodic, and circular bivariate relationship with an increasing level of white Gaussian noise.Described in [23], the procedure is executed with N = n = 1000, where n is the number of realizations of a uniform random variable X and N is the number of times the procedure is executed.
It is inferred from Table 5 that while the CoS, dCor, Ccor steadily decrease as the noise level p increases, the MICe sharply decreases as p grows from 0.5 to 2 and then reaches a plateau for p > 2. The RDC also decreases steadily with an increase in noise level for the functional dependencies considered, except for the quadratic dependence where it maintains a high power even under heavy noise level.

3) Ripley's Forms and Copula's Induced Dependence
Table 6 reports values of the MICe, the Ccor, the CoS, the RDC, and the dCor for Ripley's forms, and copula-induced dependencies for a sample size n = 1000 averaged over 1000 Monte-Carlo simulations.The values of the Spearman's S for Gumbel (5), Clayton(-0.88),Galambos(2), and BB6(2, 2) copulas are calculated using the copula and the CDVine toolboxes of the software package R.As for the four Ripley's forms displayed in Fig. 6, a linear congruential generator using the Box-Muller transformation is used to generate several bivariate sequences with nonlinear dependencies.
Table 6 shows that the CoS, MICe, RDC, and Ccor correctly reveal some degree of nonlinear dependence for Ripley's form 2, with the Ccor detecting the highest level of dependence and the dCor the lowest level.It is observed that the Ccor is the only metric to correctly reveal some degree of nonlinear dependence for Ripley's form 3.
Furthermore, unlike the MICe values, the dCor and the CoS values are very close to the Pearson's  P value for the Gaussian copula and to the Spearman's ρ S values for the Gumbel, Clayton, Galambos and BB6 copulas.

B. Statistical Power Analysis
Finally, following Simon and Tibshirani [23], the power of the statistical tests based on the CoS, dCor, RDC, TICe, and the Ccor for bivariate independence subject to increasing additive Gaussian noise levels is tested.Six noisy functional dependencies at a noise level p ranging from 10% to 300% are considered.They include a linear, a quadratic, a cubic, a fourth-root, a sinusoidal, and a circular dependence.The results are shown in Fig. 7.

C. Feature Selection Applied to Breast Cancer Data
In order to reduce computation time, improve prediction performance and reducing irrelevant data in machine learning applications, the feature selection presents the all-important step required to choose the optimal subset of data.
The dependent variables provide useless information about the classes and thus serve as noise for the predictor.The rule of thumb here is that best feature selection must include independents features that have a strong dependence with the class or the label considered.The dimensionality reduction is a part of most known methods in machine learning such as filter, wrapper end embedded methods.Pearson correlation coefficient and mutual information are largely used in feature selection; nevertheless, the results are still unsuitable.A serious alternative here is using the CoS index to work out the feature selection problem.
A useful data for this purpose is the Wisconsin Diagnostic Breast Cancer (WDBC) data, available on UCI machine learning repository.The extraction of breast tumor tissue is performed using a fine needle aspiration (FNA).The procedure begins by obtaining a small drop of the fluid in hand by examining the characteristics of individual cells and important contextual features such as the radius of the nucleus, the compactness, the smoothness, among others.
A dataset of 569 cells (malignant and benign) and 30 input features is obtained [24].Among the 30 features, 20 considered are computed from the others; hence, only 10 features are considered as initial subset.Table 6 reports the CoS measures for all pairwise feature dependence.Using 0.90 as a threshold to decide a total dependence, the subset is reduced to only 7 features.If the choice is spanned to a threshold of 0.85, the subset length is further reduced to five features [25].Fig. 8 displays the scatters of the final subset empirical copulas while Fig. 9 displays the heat maps.www.ijacsa.thesai.org

VII. CONCLUSIONS AND FUTURE WORK
A new reliable statistic for multivariate nonlinear dependence has been proposed and its statistical properties unveiled.In particular, it asymptotically approaches zero for statistical independence and one for functional dependence.Finite-sample bias and standard deviation curves of the CoS have been estimated and hypothesis testing rules have been developed to test bivariate independence.The power of the CoS-based test has been evaluated for noisy functional dependencies.Monte Carlo simulations show that the CoS performs reasonably well for both functional and nonfunctional dependence and exhibits a good power for testing independence against all alternatives.Good performance of the CoS was proved also with other application.Note that the code that implements the CoS is available on the GitHub repository. 1 As a future research work, the self-equitability of the CoS and other metrics will be assessed under various noise probability distributions and the robustness of the CoS to outliers will be investigated.Furthermore, the CoS will be applied to common signal processing and more machine learning problems, including data mining, cluster analysis, and testing of independence.Another interesting property of the CoS that is not shared by the MICe, RDC, Ccor, and the dCor is its ability to measure multivariate dependence.This property will be investigated as a future work.
Proof of Theorem 3: Suppose that Y = f(X) almost surely, where f(.) has a single optimum, which is necessarily a global one.Denote by and the non-increasing and the non-decreasing line segments of f(.), respectively.Note that f(.) may have inflection points but may not have a line segment of constant value because otherwise Y will be a mixed random variable, violating the continuity assumption.Let A denote a point with coordinate (x,y) of the function f(.).Consider the four subsets  * ≤  ≤ + ,  * ≤  + ,  *  ≤ + and  *  + .Suppose that A is a point of .As shown in Fig. 10(a), either  * + or  depending upon whether f(.) has a global minimum or a global maximum point, respectively.In the former case, P(X ≤ x, Y ≤ y) = 0, implying that C(u,v) = 0, while in the latter case, P(X > x, Y > y) = 0, implying from (9) that C(u, v) = u + v -1 ≥ 0. Combining both cases, it follows that for all (x, y)  , C(u,v) = Max(u + v -1, 0).Now, suppose that A is a point of .As shown in Fig. 10(b), either  * + or  depending upon whether f(.) has a global maximum or a global minimum point, respectively.In the former case, P(X ≤ x, Y > y) = 0, implying from (7) that C(u,v) = u while in the latter case, P(X > x, Y ≤ y) = 0, implying from (8) that C(u, v) = v.Combining both cases, it follows from (3) that for all (x,y)  , C(u,v) = min(u,v).

Proof of Theorem 5:
Suppose that Y = f(X) almost surely, where f(.) has at least two global maxima and no local optima.As depicted in Fig. 11

Corollary 1 :
Let X and Y be two continuous random variables such that Y = f(X), almost surely.If f(.) is a periodic function, then (1) and (2) holds true at all the global maxima and global minima, respectively.

Fig. 1 .
Fig. 1.Graph (in blue dots) of the projections on the (u, C(u,v)) plane of the empirical copula C(u,v) associated with a pair of random variables (X, Y), where X ~ U(-1,1) and Y= sin(2π X).The u coordinates of the data points are equally spaced over the unity interval.Similar graphs are shown for the M(u,v), W(u,v) and π (u,v) copulas.

Theorem 5 :
If Y = f(X) almost surely where (⋅) has at least two global maxima or two global minima and no local optima on the domain = Range(X) x Range(Y), then there exists a non-empty interval of X for which C(u,v)) < 1.The proof of Theorem 5 is provided in the appendix.Theorem 6: If Y = f(X) almost surely, where (⋅) has a local optimum, then C(u,v)) ≤ 1 at that point.

Definition 5 :Corollary 3 :
/n as given by(3).Let be the set of m contiguous domains {, i = 1, … , m}, where each  is a uinterval associated with a non-decreasing or non-increasing sequence of C n (u (i) ,v j ), i = 1, … , n.. Let C i min and C i max respectively denote the smallest and the largest value of C n (u,v) on the domain .Let i be defined as: Let n i denote the number of data points in the i-th domain , i = 1,…, m, while letting a boundary point belong to two contiguous domains,  and +1.Then, the copula statistic is defined as: The CoS of two random variables, X and Y, has the following asymptotic properties:

Fig. 5 .
Fig. 5. Bias curves of the CoS, MICe, dCor, RDC, and Ccor for the bivariate Gaussian copula with ( , ) =0.2, 0.5 and 0.8, which are displayed in a), b), and c), respectively, and for sample sizes that vary from 50 and 2000 with steps of 50.

Fig. 8 .Fig. 9 .
Fig. 8. Scatters of empirical copulas for a) the final feature subset except the perimeter and for b) the final subset, where the CoS values are respectively 0.27 and 0.26.
(a), let and  be two global maximum points of f(.) with coordinates (xB, ymax) and (xC, ymax), respectively.This means that there exists  such that ( )  and ( )  .Consider a point with coordinate (  ) such that    , ( − )   and ( − )   .Denote by and the line segments of f(.) defined over the intervals , ( − ) and , ( − )  -, respectively, which are shown as solid lines in Fig. 11(a).Partition the domain  into four subsets,  * ≤  ≤  +,  * ≤   +,  *  ≤  + and  *   +.As observed in Fig. 12(a),  * + ,  ,  , and  , yielding C,(u,v)) < 1.A similar proof can be developed for the case where f(.) has at least two global minima and no local optima.www.ijacsa.thesai.orgProof of Theorem 6: Suppose that Y = f(X) almost surely, where f(.) has a local minimum point, say point A of coordinates (  ) as shown in Fig. 11(b).This means that there exists  such that ( )  .As depicted in Fig. 11(b), let and denote the line segments of f(.) defined over  −  and   , respectively.Consider the four domains,  * As observed in Fig. 11(b),  and  .Now, because A is by hypothesis a local minimum point, there exist line segments of f(.) denoted by such that f(y) < yA.Consequently, one of the following three cases arises: either  * + and  as depicted in Fig. 11(b), or  * + and  , or  * + and .In the first case, C,(u,v)) < 1 while in the last two cases, C,(u,v)) = 1.A similar proof can be developed for f(.) with a local maximum point.

Fig. 10 .Fig. 11 .
Fig. 10.Graphs of a function Y = f(X) having a single optimum.A point A with coordinate (x,y) is located either on the non-increasing part, , shown as a solid line in (a) or on the non-decreasing part, , shown as a dashed line in (b) of the function f (.).Four domains,  , …,  , are delineated by the vertical and horizontal lines at position X = x and Y = y, respectively.
Calculate the absolute difference between the three consecutive values of C n (u (i) ,v j ) centered at u i min (respectively at u i max ) and decide that the central point is a local optimum if (i) both absolute differences are smaller than or equal to 1/n; and (ii) there are more than four points within the two adjacent domains,  and  ; 8) Calculate given by (16); 9) Repeat Steps 2 through 7 for all the m domains,  , i = 1, … , m; 10) Calculate the CoS given by (17).

TABLE I .
SAMPLE MEANS AND SAMPLE STANDARD DEVIATIONS OF THE COS FOR THE GAUSSIAN, GUMBEL, AND CLAYTON COPULA IN THE INDEPENDENCE CASE

TABLE II .
TYPE-II ERRORS OF THE STATISTICAL TEST OF BIVARIATE INDEPENDENCE BASED ON COS FOR GAUSS(0) N µn0  Type-II error for n = 0.1 Type-II error for n = 0.3

TABLE IV .
SAMPLE MEANS AND SAMPLE STANDARD DEVIATIONS OF THE COS FOR THE NORMAL COPULA

TABLE V .
SAMPLE MEANS OF THE COS, DCOR AND THE MICE FOR SEVERAL DEPENDENCE TYPES AND ADDITIVE NOISE LEVELS

TABLE VI .
DEPENDENCE INDICES FOR COPULA DEPENDENCIES AND RIPLEY'S FORMS