University of Huddersfield Repository

—Computing statistical dependence of terms in textual documents is a widely studied subject and a core problem in many areas of science. This study focuses on such a problem and explores the techniques of estimation using the expected mutual information measure. A general framework is established for tackling a variety of estimations: (i) general forms of estimation functions are introduced; (ii) a set of constraints for the estimation functions is discussed; (iii) general forms of probability distributions are deﬁned; (iv) general forms of the measures for calculating mutual information of terms (MIT) are formalised; (v) properties of the MIT measures are studied and, (vi) relations between the MIT measures are revealed. Four estimation methods, as examples, are proposed and mathematical meanings of the individual methods are respectively interpreted. The methods may be directly applied to practical problems for computing dependence values of individual term pairs. Due to its generality, our method is applicable to various areas, involving statistical semantic analysis of textual data.


I. INTRODUCTION
Analysing and computing statistical dependence (relatedness, proximity, association, similarity) of terms (features, concepts, phrases, words) in textual documents is a widely studied subject in many areas of science. The subject has achieved importance and popularity during the past four decades or so, due chiefly to its demonstrated applications in numerous seemingly diverse areas of science. One of the commonly used tools of analysis and computation is the expected mutual information measure (EMIM) drawn from information theory [1], [2].
The issue of computing the mutual information of terms is an active research topic. A variety of methods have been developed in order to assign dependence values to individual term pairs, and then some decision is made on the basis of the values. Many studies have used the measure for a variety of tasks in, for instance, feature selection [3]- [6], document classification [7], face image clustering [8], multimodality image registration [9], information retrieval [10]- [14]. However, it seems that mutual information methods have not achieved their potential. The main problem we face in using EMIM is obtaining actual probability distributions, as the true distributions are invariably not known, and we have to estimate them from training data. This work explores techniques of estimation.
Before introducing a series of formulae, let us first clarify the difference between a term state value distribution and a term occurrence frequency distribution. A term is usually thought of as having states 'present' or 'absent' in a document. Thus, for an arbitrary term t, it will be convenient to introduce a variable δ taking values from set Ω = {1, 0}, where δ = 1 expresses that t is present and δ = 0 expresses that t is absent. Denote t δ = t,t when δ = 1, 0, respectively. We call Ω a state value space, and each element in Ω a state value, of t. Similarly, for an arbitrary term pair (t i , t j ), we introduce a variable pair (δ i , δ j ) taking values from set Ω × Ω = {(1, 1), (1, 0), (0, 1), (0, 0)}. We call Ω × Ω a state value space, and each element in Ω × Ω a state value pair, of (t i , t j ).
Let D = {d 1 , d 2 , ..., d m } be a collection of documents (training data), and V = {t 1 , t 2 , ..., t n } a vocabulary of terms used to index individual documents in D. Denote V d ⊆ V as the set of terms occurring in document d ∈ D. Thus, for a given d, the term occurrence frequency distribution, generally denoted by p d (t) = p(t|d), is over V , whereas for a given term t occurring in d, its state value distribution, denoted by P d (δ) = P (t δ |d), is over Ω. Obviously, each term t ∈ V d is matched to a state value distribution and there are |V d | state value distributions in total for the document d.
There exists statistical dependence between two terms, t i and t j , if the state value of one of them provides mutual information about the probability of the state value of the other [15]. The study [16] shows that there is a relationship between the frequencies (or probabilities) of terms and the mutual information of terms. Therefore, term t i taking some state value δ i (say δ i = 1) should be looked upon as complex because another state value (say δ i = 0) of t i , and state values of many other terms (i.e., all terms t j ∈ V − {t i }), may be dependent on this δ i [15].
Mathematically, for two arbitrary distinct terms t i , t j ∈ V , the expected mutual information [1] about the probabilities of the state value pair (δ i , δ j ) of term pair (t i , t j ) can be (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 4, No. 11, 2014 expressed by EMIM: which measures the amount of information that δ j provides about δ i , and vice versa. Intuitively, a high I(δ i ; δ j ) value indicates more of the information that one of two terms t i and t j carries is determined by the other and thus the terms are more dependent; a low I(δ i ; δ j ) value on the other hand suggests that t i and t j are better able to provide self-information and thus are likely to be independent. However, the current study does not support this intuition and instead points out: 1) one should consider the mutual information of t i and t j under the individual state values (δ i , δ j ), where δ i , δ j = 1, 0; 2) one cannot assert that t i and t j are highly dependent for their co-occurrence from a high I(δ i ; δ j ) value. The estimation of probability distributions, P (δ) and P (δ i , δ j ), required in I(δ i ; δ j ) is crucial and remains an open issue for effectively distinguishing potentially dependent term pairs from many others and, therefore, the main concern of our current study. We attempt to establish a general framework for constructing estimation functions, with a set of constraints, in order to define P (δ) and P (δ i , δ j ) meeting some criteria. We next formalise measures for computing the mutual information of terms (MIT) under the individual state values and study corresponding properties of the MIT measures, which is an underlying basis for practical applications. We then propose four estimation methods, as examples, to clarify and illustrate our ideas described in the current study by interpreting their mathematical meanings and discussing corresponding properties. The four estimation methods may be applied directly to practical problems for assigning a dependence value to each term pair.
The remainder of the paper is organized as follows. Section II establishes a general framework for constructing estimation functions and defining probability distributions. Section III formalises the MIT measures and studies their properties. Section IV proposes four estimation methods and discusses corresponding properties. Section V addresses some key points of our study. Conclusions are drawn in Section VI.

II. A GENERAL ESTIMATION FRAMEWORK
In practical applications, the probability distributions of state values may be estimated from training data. This section establishes a general framework in order to define two arguments, P (δ) and P (δ i , δ j ), required in I(δ i ; δ j ). The definition of the joint state value distribution, P (δ i , δ j ), is a more complicated task and the main concern of this section.
In the current study, the probability distributions are defined from estimation functions and, therefore, we need to first introduce the concept of estimation functions. Let Ξ ⊆ D be the set of sample documents considered, and V Ξ ⊆ V the set of terms occurring in at least one of the documents in Ξ. We have the following definition. Definition 2.1 For arbitrary terms t, t i , t j ∈ V , where i = j, we define two non-negative functions, denoted by ψ Ξ (t) and γ Ξ (t i , t j ), with the form: satisfying a set of constraints and call ψ Ξ (t) and γ Ξ (t i , t j ) the general forms of estimation functions.
Definition 2.2 For arbitrary given terms t, t i , t j ∈ V Ξ , where i = j, suppose ψ Ξ (t) and γ Ξ (t i , t j ) are the estimation functions given in Definition 2.1. We define P Ξ (δ): and define P Ξ (δ i , δ j ): and call P Ξ (δ) and P Ξ (δ i , δ j ) the general forms of probability distributions of state values of term pair (t i , t j ).
Let us next examine the absolute continuity of P Ξ (δ i , δ j ) with respect to P Ξ (δ i )P Ξ (δ j ), or in symbols, P Ξ (δ i , δ j ) P Ξ (δ i )P Ξ (δ j ). The following theorem serves this purpose. Theorem 2.2 Suppose P Ξ (δ) and P Ξ (δ i , δ j ) are given in Definition 2.2. Then, P Ξ (δ i , δ j ) P Ξ (δ i )P Ξ (δ j ) for δ i , δ j = 1, 0. Proof: The proof is trivial: It can be easily seen, by expressions (1) and (3), that it always has 0 < P Ξ (δ i ), It should be emphasized that in order to speak of the mutual information of terms, we must verify the two arguments of I(δ i , δ i ) meeting the following three criteria simultaneously: 1) P Ξ (δ) and P Ξ (δ i , δ j ) are probability distributions, 2) P Ξ (δ i ) and P Ξ (δ j ) are the marginal distributions of P Ξ (δ i , δ j ), 3) P Ξ (δ i , δ j ) is absolutely continuous with respect to P Ξ (δ i )P Ξ (δ j ). Meeting these three criteria is the major premise when applying I(δ i ; δ j ) to effectively capture the mutual information inherent among terms. We will give an example to clarify our idea here in Section V.
We thus learn from Theorems 2.1 and 2.2, under the general framework, that as long as P d (δ) and P d (δ i , δ j ) are defined from the estimation functions satisfying the constraints given in (2), they are probability distributions meeting the three criteria. Consequently, the difficulty becomes: • to construct ψ Ξ (t) and γ Ξ (t i , t j ) that can capture the occurrence and co-occurrence information of terms practically appropriate and mathematically meaningful in application contexts; • to verify the constraints given in (2) for each term pair considered in order to ensure that the probability distributions, when defined from ψ Ξ (t) and γ Ξ (t i , t j ), meeting the three criteria. Thus, the construction of ψ Ξ (t) and γ Ξ (t i , t j ) and verification of the constraints given in (2), which are relatively simple, are the core of obtaining actual probability distributions P Ξ (δ) and P Ξ (δ i , δ j ). Section IV will return to this issue and provide four useful examples, after formalising the MIT measures and discussing their properties and relations in the next section.
III. THE MIT MEASURES Suppose we are given two arbitrary distinct terms t i , t j ∈ V Ξ . In order to measure the mutual information of terms t i and t j , we need to consider the mutual information under each state value (δ i , δ j ), namely, we need to measure the extent of the contribution made by the individual state values to EMIM: Note that the above expression can be expressed as a sum of four items. Each of four items, can be regarded as 'mutual information of terms', t i and t j , in support of dependence but rejecting independence under state value (δ i , δ j ), where δ i , δ j = 1, 0. Thus, we can regard each item as a MIT measure, computing the extent of the contributions made by the corresponding state value to I Ξ (δ i ; δ j ). Now, substituting estimates (3) and (4) into (6), corresponding to respective four state value pairs, (1, 1), (1, 0) (0, 1), (0, 0), we can formalise the general forms of the four MIT measures by a definition below: Definition 3.1 Suppose P Ξ (δ) and P Ξ (δ i , δ j ) are the probability distributions given in Definition 2.2. Then the general forms of four MIT measures can be defined as follows.
which computes the dependence of terms t i and t j for their co-occurrence in Ξ; which computes the dependence of term t i occurring but term t j not occurring in Ξ; which computes the dependence of term t i not occurring but term t j occurring in Ξ; which computes the dependence of both terms t i and t j not occurring in Ξ.
Clearly, each of the four MIT measures is uniquely determined by the estimation functions ψ Ξ (t) and γ Ξ (t i , t j ).
Next, we give some interesting properties of the four MIT measures by Theorem 3.1 below. The properties derive their importance from the fact that they underpin the methods proposed in the current study and are essential for guiding practical applications.  Proof: The proof of (b) is obvious. Here we prove only (a), and a similar proof can be given for (c). Consider the general forms of the four MIT measures.
which correspond respectively to Hence, the four inequalities in (a) hold.
The properties given in Theorem 3.1 enable us to gain an insight into the signs of the four MIT measures. That is, we have Clearly, the relation between γ Ξ (t i , t j ) and ψ Ξ (t i )ψ Ξ (t j ) can infer all the signs of mit Ξ (t δi i , t δj j ) for δ i , δ j = 1, 0. Thus, with the properties given in Theorem 3.1, we can further learn the relations of the four MIT measures from the signs: • The signs of mit Ξ (t i , t j ) and mit Ξ (t i ,t j ) are always the same, so are the signs of mit Ξ (t i ,t j ) and mit Ξ (t i , t j ); • The signs of mit Ξ (t i , t j ) and mit Ξ (t i ,t j ) are always opposite to the signs of mit Ξ (t i ,t j ) and mit Ξ (t i , t j ). The relations tells us a key point of applying I Ξ (δ i ; δ j ), which we will explain in Section V.

IV. EXAMPLE ESTIMATIONS
As mentioned previously, the construction of the estimation functions and verification of the constraints are the core of defining actual probability distributions. This section presents four estimation methods, as examples, to illustrate our ideas described in the previous section. The first three consider the estimates in individual documents (i.e., |Ξ| = 1), and the last one considers the estimate in the set of documents (i.e., |Ξ| > 1).
In what follows, we always assume that 2 < |V d | ≤ n (where n = |V |), namely, each document d ∈ D has at least three distinct terms. Also, for an arbitrary term t ∈ V , we denote

A. Estimate in a Single Document
Suppose each document d is represented by a 1 × n frequency matrix Then, for an arbitrary term t ∈ V , introduce an estimation function: Clearly, we have 0 < ψ d (t) < 1 for every t ∈ V d ⊆ V . Next, for an arbitrary given term t ∈ V d , define a probability distribution by expression (3): The function ψ d (t) and distribution P d (δ) will be used in the three methods below.

A.1 Method One
For two arbitrary distinct terms t i , t j ∈ V , introduce an estimation function: where the denominator of γ d is, www.ijacsa.thesai.org which is the sum of all the possible products Next, for two arbitrary given terms t i , t j ∈ V d (where i = j), define a probability distribution by expression (4): In order to verify the constraints given in (2): for an arbitrary t ∈ V d , let us denote Study [15] has proven, for the functions ψ d (t) and γ d (t i , t j ) given in (7) and (9), respectively, we have: . Thus we can write immediately the following theorem [15].
The above theorem tells us, when the estimation functions given in (7) and (9) are used, that P d (δ i , δ j ) given in (10) is a probability distribution if two conditions t j ≥ f 2 d (t j ) and t i ≥ f 2 d (t i ) are satisfied simultaneously. The conditions can also be verified by p d (t i ) ≥ γ d (t i , t j ) and p d (t j ) ≥ γ d (t i , t j ), respectively, which may be easier to compute in practical application. Next, we give the property of the MIT measures by the following corollary.  (8) and (10), four inequalities, . Proof: By Theorem 4.1, P d (δ i , δ j ) given in (10) is a probability distribution for terms t i , t j ∈ V d . Also, Thus, from (a) of Theorem 3.1, four inequalities hold.

A.2 Method Two
Note that f d (t) is the number of time(s) that term t occurs in d and that f d (t 1 ) + f d (t 2 ) + ... + f d (t n ) = ||d||. Thus, the probability that two distinct terms t i and t j are simultaneously found in d should be Hence, for two arbitrary distinct terms t i , t j ∈ V , introduce an estimation function: Next, for two arbitrary given terms t i , t j ∈ V d (where i = j), define a probability distribution by (4): We may give two conditions of P d (δ i , δ j ), such that it satisfies the constraints given in (2) by the following theorem.
A similar proof can be applied to p d (t j ) ≥ γ d (t i , t j ).

A.3 Method Three
The probability that term t j is found in d after term t i has been found in d, where i = j, should be Thus, for two arbitrary distinct terms t i , t j ∈ V , introduce an estimation function: Next, for two arbitrary given terms t i , t j ∈ V d (where i = j), define a probability distribution by (4): We need to find out if there exists any verification condition, such that P d (δ i , δ j ) satisfies the constraints given in (2), by the following theorem.  (14) is a probability distribution.
A similar proof can be applied to p d (t j ) > γ d (t i , t j ).
It is clear, unlike Methods 1 and 2, that P d (δ i , δ j ) in (14) is a probability distribution unconditionally. Next, we give the property of the MIT measures by the following corollary. (8) and (14), four inequalities

Corollary 4.3 For the four MIT measures derived from expressions
always hold for arbitrary terms t i , t j ∈ V d . Proof: By Theorem 4.3, P d (δ i , δ j ) given in (14) is a probability distribution for terms t i , t j ∈ V d . Also, from ||d|| − f d (t i ) < ||d||, we have, Hence, from (a) of Theorem 3.1, the four inequalities hold.

B. Estimate in a Set of Documents
The above three estimation methods consistently use frequency representation for the individual documents. However, in some probabilistic methods, one would state that the binary assumption suffices to specify the dependence of terms. The method discussed here is under this assumption.
By 'binary' it is here meant that each document d ∈ D is represented by a 1 × n matrix: in which, each element in the matrix is a binary number satisfying t δ = 1 when t ∈ V d and t δ = 0 when t ∈ V − V d . Consider a sample set Ξ, satisfying |Ξ| > 1. Denote n Ξ (t) as the number of documents in Ξ in which term t occurs, and n Ξ (t i , t j ) as the number of documents in Ξ in which terms t i and t j co-occur. It can be easily seen n Ξ (t i , t j ) ≤ n Ξ (t i ), n Ξ (t j ) ≤ |Ξ| Then, for an arbitrary term t ∈ V , introduce an estimation function: Obviously, we have 0 < ψ Ξ (t) < 1 for every t ∈ V Ξ ⊆ V . Next, for an arbitrary given term t ∈ V d , define a probability distribution by expression (3): The function ψ Ξ (t) and distribution P Ξ (δ) will be used in the fourth method below.

B.1 Method Four
For two arbitrary distinct terms t i , t j ∈ V , introduce an estimation function: Next, for two arbitrary given terms t i , t j ∈ V Ξ (where i = j), define a probability distribution by expression (4): for arbitrary t i , t j ∈ V Ξ . Hence the estimation functions given in (15) and (17) satisfy the constraints given in (2) and, thus we can give the following theorem.   (16) and (18), The Method 4 is the most commonly used in many areas, such as, information retrieval, natural language processing, document classification, sentiment analysis, and many related areas. More discussion on this method, including its properties and potential application problems, can also be found in [17].

V. DISCUSSION
Some key points, which are helpful to understand the methods proposed under the general framework, are addressed in this section. These key points are also important to guide practical applications. First, it should be possible, though it may not be easy, to construct a variety of estimation functions and then to define probability distributions and verify the corresponding constraints for formalising the MIT measures. For suitable choices of the estimation functions practically appropriate for and mathematically meaningful to a specific application problem, the term state distributions, when substituted into measures, mit Ξ (t δi i , t δj j ) (δ i , δ j = 0, 1) and/or I(δ i ; δ i ), can be expected to capture the mutual information of terms. The information may be used to develop a variety of techniques in order to assign dependence values to individual term pairs and, then some decision is made on the values. A summary of the four example estimation methods proposed in this study is given in Table I. It is important to understand that the MIT measures formalised by different estimation methods may have entirely different properties. For instance, let us return to the four example estimations discussed in Section IV and consider an inequality, Then some key points regarding the properties and relationships of the MIT measures of the four corresponding Methods 1-4 can be made below.
• Theorems/Corollaries 4.1-4.3 in respective Methods 1-3 tell us, when estimation functions (7), (9), (11) and (13) are used, that the above inequality always holds, and that terms co-occurring in document d must be more or less statistically dependent since it is always mit d (t i , t j ) > 0 supporting a dependence assertion. • Theorem/Corollary 4.4 in Method 4 tells us, when estimation functions (15) and (17) are used, that the above inequality does not always hold, and that terms may or may not be statistically dependent for their co-occurrence since the sign of mit Ξ (t i , t j ) might be different from term pair to term pair. Therefore, we can learn from the Theorems/Corollaries: for two terms making the above inequality hold, some estimation functions ensure them to be more or less dependent for their co-occurrence, whereas other estimation functions cannot guarantee them to be dependent for their co-occurrence. This also clearly indicates, for the same term pairs, that different estimation methods may result in entirely different conclusions regarding the statistical dependence for their co-occurrence.
Second, as we all knew, the MIT measures may influence experimental performance significantly. However, as the probability distributions are normally obtained according to practical application, it seems that only the "form" of the mutual information measure has frequently been the main concern of research in literature, whereas the problem of verification of the probability distributions is often ignored as none an unimportant matter. This implicitly means that a function with a form would be a "mutual information measure" of x 1 and x 2 for their co-occurrence, and that the discussion on the three criteria of P (x) and P (x 1 , x 2 ) in the function are trivial. This is indeed not true. It is important to realise that it is not necessarily that the function, i(x 1 , x 2 ), is a mutual information measure. In fact, i(x 1 , x 2 ) is not a mutual information measure in the information-theoretic sense, if P (x) and P (x 1 , x 2 ) are not probability distributions and/or, if P (x 1 ) and P (x 2 ) are not marginal distributions of the joint distribution P (x 1 , x 2 ) (even though they may be all probability distributions). It may not even converge if P (x 1 , x 2 ) f 1 (x 1 ) and P (x 1 , x 2 ) P (x 2 ) do not hold. Therefore, in practical applications, it entirely makes no sense to use some function, looking like a mutual information measure, to compute the mutual information of terms when any one of the three criteria is not satisfied. We emphasize that the verification of P (δ) and P (δ i , δ j ) meeting the three criteria is the major premise when applying I(δ i ; δ j ) to effectively capture the mutual information inherent among terms. A simple but interesting example given in our related study [15] may clarify our idea. We here give a brief explanation and details of computation can be found in [15]. Suppose we are given a document d = {t 1 , t 2 , t 2 , t 2 , t 3 , t 4 } ∈ D. This example considers the estimation functions given in Method 1 and illustrates a specific instance of failing to apply them for two terms t 1 , t 2 ∈ V d : and, with expressions (7), (8), (9) and (10), we have γ d (t 1 , t 2 ) = 1 4 and It can be easily seen, for the term pair (t 1 , t 2 ), that the corresponding P d (δ 1 , δ 2 ) is not a probability distribution since the constraints given in (2) are not satisfied (i.e., ψ d (t 1 ) < γ d (t 1 , t 2 )). Consequently, P d (δ 1 = 1, δ 2 = 1) is not reliable for measuring dependence of t 1 and t 2 for their co-occurrence.
The key points regarding the probability distributions are: • There may be many term pairs, of which the corresponding P d (δ i , δ j ) is indeed a probability distribution.
However, it is possible that not all term pairs have the corresponding probability distribution. • In order to compute MIT of terms, we must verify the constraints given in (2), that is, we have to check both to be satisfied simultaneously, for each of the term pairs considered. Thus, those term pairs (rather than two individual terms), of which the corresponding P d (δ i , δ j ) does not satisfy the constraints, should be discarded immediately and omitted from the computation of MIT.
Third, the estimation functions given in Methods 1-3 can be applied to document representations not only for m d = f d (t) 1×n , but also for a more general case, where each document d can be represented by a 1 × n (weight) matrix: in which, each element is a real number, satisfying w d (t) > 0 when t ∈ V d and w d (t) = 0 when t ∈ V − V d . The w d (t) is called a weighting function, which indicates the importance of term t in representing document d. For instance, the weighting function in Methods 1-3 is w d (t) = f d (t). The key points regarding the estimation functions are below.
• Methods 1-3 should be applicable to any quantitative document representation. • ψ d (t) and γ d (t i , t j ) should be used to capture the information of occurrence and co-occurrence of terms. • w d (t) should be the main component of the estimation functions, it is construed by means of occurrence frequencies and co-occurrence frequencies of terms. The extension of, for instance, Method 1 can be found in another of our studies [15]. It is beyond the scope of the current paper to discuss the issue of document representation in greater detail, and some formal discussion and technical treatment can be found in, for instance studies [18]- [20].
Fourth, it is certainly true that the MIT measures given in Definition 3.1 can be used to measure the extent of dependence of terms t i and t j . Also, it is certainly true that the larger quantities the measures offer, the higher the extent term t i is statistically dependent on term t j (and vice versa). However, the implications of the dependence obtained from the individual MIT measures are different. Remember that we always emphasize 'the dependence under the state value (δ i , δ j )'. This emphasis is necessary because it clearly indicates that it is the state value (δ i , δ j ) that supports the dependence. For instance, (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 4, No. 11, 2014 terms t i and t j may depend highly on one another, when t i occurs but t j does not occur in some document and, in this case, we are talking about the dependence under the state value (δ i , δ j ) = (1, 0). In a practical application, what we generally concentrate on is the statistics of co-occurrence of terms. That is, the dependence with which we are really concerned is state value (δ i , δ j ) = (1, 1) of term pair (t i , t j ). In this case, what we need is to apply only the first item of I(δ i ; δ j ) and to verify the constraints given in (2). For instance, for Method 1, we need only use the measure mit d (t i , t j ) and verify the condition: to ensure that t i and t j are highly dependent under their cooccurrence. Fifth, from a high expected mutual information value, we cannot state immediately that state value (δ i , δ j ) = (1, 1) makes a larger contribution to I Ξ (δ i ; δ j ) and, thus we cannot assert that terms t i and t j are highly dependent for their cooccurrence in Ξ. This is because, with the relations of the MIT measures learned from their signs, when γ Ξ (t i , t j ) < ψ Ξ (t i )ψ Ξ (t j ), the positive value I Ξ (δ i ; δ j ) will be dominated by the positive quantities mit Ξ (t i ,t j ) and mit Ξ (t i , t j ). In this case, the higher value the I Ξ (δ i ; δ j ) has, the larger quantities the mit Ξ (t i ,t j ) and mit Ξ (t i , t j ) provide, the more it is indicated that t i and t j are highly dependent under state values (1, 0) and (0, 1) and that they should not co-occur in Ξ. We can clarify our viewpoint by an example given in [17]. Let us consider Method 4 and suppose Ξ = {d 1 , d 2 , d 3 }, V d1 = {t 1 , t 2 , t 3 , t 4 , t 5 }, V d2 = {t 1 , t 4 , t 5 , t 7 } and V d3 = {t 4 , t 7 , t 8 }. Then, we have: n Ξ (t 1 ) = 2, n Ξ (t 2 ) = 1 and n Ξ (t 1 , t 2 ) = 1; n Ξ (t 5 ) = 2, n Ξ (t 7 ) = 2 and n Ξ (t 5 , t 7 ) = 1. Thus, we obtain (details of computation can be found in [17]) Clearly, the positive value of I Ξ (δ 1 ; δ 2 ) is dominated by both quantities mit Ξ (t 1 , t 2 ) and mit Ξ (t 1 ,t 2 ), and t 1 and t 2 are highly dependent for their co-occurrence and conot-occurrence in set Ξ; the positive value of I Ξ (δ 5 ; δ 7 ) is dominated by both mit Ξ (t 5 ,t 7 ) and mit Ξ (t 5 , t 7 ), and t 5 and t 7 are highly dependent for their not co-occurrence in set Ξ. In addition, from this example, we can see that term pairs (t 1 , t 2 ) and (t 5 , t 7 ) have the same expected mutual information and, however, that the implications of for the individual state values are entirely different: Terms t 1 and t 2 provide the information highly supporting for both their co-occurrence and co-not-occurrence; whereas terms t 5 and t 7 provide the information highly supporting for occurrence of one but not occurrence of the other. It should be repeatedly pointed out that all the five different measures, the four MIT measures and the EMIM measure, may give us useful information, but each tells us different aspects about the dependences of terms and, in particular, it is likely that I Ξ (δ i ; δ j ) tells us nothing about the dependences of terms for their co-occurrence.
Sixth, it is worth mentioning that many studies use the following formula: to estimate the mutual information of terms t i and t j . It is 'equivalent' to the MIT measure for the state value (δ i , δ j ) = (1, 1) given in Definition 3.1, as we denote t δ = t,t when δ = 1, 0, respectively. The expression I(t i ; t j ) seems simpler to that of mit(t i , t j ). However, we point out, mathematically, that mit(t i , t j ) is more appropriate and clearer than I(t i ; t j ) from, for instance, a viewpoint of the probability space: It is obvious to see that P (δ i , δ j ) is over Ω × Ω as its each argument δ ∈ Ω = {0, 1}, whereas it is easy to cause confusion that P (t i , t j ) is over V × V as each of its arguments has a domain t ∈ V = {t 1 , t 2 , ..., t n } (rather than t ∈ {0, 1}). Also, I(· ; ·), when used to expressed EMIM, is a traditional mathematical symbol, which is the summation of four items (rather than only one) corresponding to four state value pairs of each term pair. Seventh, it is worth mentioning that there are five information measures widely used in the literature for computing term dependence (or, relatedness): directed divergence [1], divergence [1], information radius [21], Jensen difference [22] and the expected mutual information (i.e., EMIM, which is regarded as a special case of directed divergence) [1]. The five measures, which are what are generally called information gain, are by now familiar to many researchers. A detailed account of the concept of the measures is given in [1], and an axiomatic characterization can be found in [23]. The five measures are examined in our series of studies: Study [19] develops the measurement of term relatedness using the information radius measure, demonstrates how the relatedness measures may deal with some basic concepts of applications, and summarizes important features of, and differences between, the information radius measure and the first two information measures (directed divergence and divergence), from a practical perspective. Study [18] addresses the measurement of term relatedness based on the Jensen difference measure and points out, when Shannon entropy is used, that the Jensen difference measure is in fact the information radius measure, and that some formal methods proposed in many past studies in terms of these two measures are in principle the same matter. Study [15] proposes a method for estimating probability distributions required in EMIM, and provides examples to illustrate the possibility of failure of applying this method if the verification conditions are not satisfied. Study [17] reconsiders the emim measure, which is widely used in applications, derived from simplifying EMIM under a binary assumption, and discusses some potential but important problems of applying the emim measure. Study [20] attempts to establish a unified theoretical framework for applying several information measures to the measurement of term discrimination information and to define relatedness measures according to the discrimination measures, and then discusses some potential problems arising from using the relatedness measures and suggests solutions. Finally, we would like to point out that the current study is further work of study [15], [17]: it focuses on the establishment of a general framework for constructing estimation functions in order to define probability distributions required in EMIM for effectively distinguishing potentially dependent term pairs from many others. As this paper concentrates on a formal analysis and discussion, the reader interested in how the mutual information methods, as well as other information measures' methods, may be supported by empirical evidence drawn from a number of performance experiments is referred to those papers referenced.

VI. CONCLUSIONS
This study focused on the establishment of a general framework for defining probability distributions required in EMIM, which is crucial and remains an open issue, for effectively distinguishing potentially dependent term pairs from many others. Under the framework, -the general forms of estimation functions with a set of constraints were introduced; -the general forms of probability distributions under term state values were defined; -the general form of MIT measures for computing the mutual information of terms was formalised; -the general properties of the MIT measures were studied and the general relations between the MIT measures were revealed. Four estimation methods were proposed to clarify and illustrate our ideas presented in this study by -interpreting the mathematical meanings of the estimation functions within practical application contexts; -discussing verification conditions for satisfying the constraints in order to ensure that probability distributions meet the three criteria; -presenting the properties and relationships of the MIT measures given in the individual methods. The key points of this study were pointed out and emphasised, some of them are: -The different implications of the dependence obtained from the individual MIT measures and the EMIM measure should be carefully distinguished from one another. -The estimation functions should be constructed using weighting functions capable of capturing the occurrence and co-occurrence information of terms. -It is possible of failure of using the estimation functions to define probability distributions if the constraints are not satisfied.
Under the general framework, the probability distributions, when defined from the estimation functions satisfying the constraints, will meet the three criteria. Thus, the issue of defining the probability distributions becomes the issue of constructing the estimation functions and verifying the constraints, which is relatively simple for practical applications. Due to its generality, the general framework is applicable to many areas of science, involving statistical semantic analysis of features (concepts, terms, phrases, words, etc.) and quantitative representations of objects (documents, abstracts, sentences, queries, etc.).