Abstract

There are various definitions of mutual information. Essentially, these definitions can be divided into two classes: (1) definitions with random variables and (2) definitions with ensembles. However, there are some mathematical flaws in these definitions. For instance, Class 1 definitions either neglect the probability spaces or assume the two random variables have the same probability space. Class 2 definitions redefine marginal probabilities from the joint probabilities. In fact, the marginal probabilities are given from the ensembles and should not be redefined from the joint probabilities. Both Class 1 and Class 2 definitions assume a joint distribution exists. Yet, they all ignore an important fact that the joint or the joint probability measure is not unique. In this paper, we first present a new unified definition of mutual information to cover all the various definitions and to fix their mathematical flaws. Our idea is to define the joint distribution of two random variables by taking the marginal probabilities into consideration. Next, we establish some properties of the newly defined mutual information. We then propose a method to calculate mutual information in machine learning. Finally, we apply our newly defined mutual information to credit scoring.

1. Introduction

Mutual information has emerged in recent years as an important measure of statistical dependence. It has been used as a criterion for feature selection in engineering especially in machine learning (see [13] and references therein).

Mutual information is a concept rooted in information theory. Its predecessor, called the rate of transmission, was first introduced by Shannon in 1948 in a classical paper [4] for the communication system. Shannon first introduced a concept called entropy for a single discrete chance variable. He then defined the joint entropy and conditional entropy for two discrete chance variables using the joint distribution. Finally, he defined the rate of transmission as the difference between the entropy and conditional entropy. While Shannon did not define a chance variable in his paper, it is understood to be a synonym of a random variable.

Since Shannon’s pioneering work [4], there have been various definitions for mutual information. Essentially, these definitions can be divided into two classes: (1) definitions with random variables and (2) definitions with ensembles, that is, probability spaces in the mathematical literature.

Class 1 definitions of mutual information depend on the joint distribution of two random variables. More specifically, Kullback ([5], 1959) defined entropy, conditional entropy, and joint entropy using compact mathematical formulas. Pinsker ([6], 1960 and 1964) treated the fundamental concepts of Shannon in a more advanced manner by employing probability theory. His definition of mutual information was more general in that he implicitly assumed the two random variables had different probability spaces. Ash ([7], 1965) explicitly assumed the two random variables had the same probability space and followed Shannon’s way to define mutual information. Cover and Thomas ([8], 2006) defined mutual information in a simple way by avoiding mentioning probability spaces.

Class 2 definitions depend on the joint probability measure of the joint sample space of two ensembles. Among such definitions, Fano ([9], 1961), Abramson ([10], 1963), and Gallager ([11], 1968) developed their definitions in a similar way. They first defined the entropy of an ensemble, condition entropy, and the joint entropy of two ensembles. Next, they defined the mutual information of a joint event. Noting that the mutual information of a joint event is a random variable, they calculated the mean value of this random variable and called the result the mean information of two ensembles.

However, there are some mathematical flaws in these various definitions of mutual information. Class 2 definitions redefine marginal probabilities from the joint probabilities. As a matter of fact, the marginal probabilities are given from the ensembles and hence should not be redefined from the joint probabilities. Moreover, except Pinsker’s definition, Class 2 definitions either neglect the probability spaces or assume the two random variables have the same probability space. Both Class 1 definitions and Class 2 definitions assume a joint distribution or a joint probability measure exists. Yet, they all ignore an important fact that the joint distribution or the joint probability measure is not unique.

In this paper, we first present a unified definition for mutual information using random variables with different probability spaces. Our idea is to define the joint distribution of two random variables by taking the marginal probabilities into consideration. With our new definition of mutual information, different joint distributions will result in different mutual information. Next, we establish some properties of the newly defined mutual information. We then propose a method to calculate mutual information in machine learning. Finally, we apply our newly defined mutual information to credit scoring.

The rest of the paper is organized as follows. In Section 2, we briefly review the basic concepts in probability theory. In Section 3, we examine various definitions of mutual information. In Section 4, we first propose a new unified definition for mutual information and then establish some properties of the newly defined mutual information. In Section 5, we first propose a method to calculate mutual information in machine learning. We then apply the newly defined mutual information to credit scoring. The paper is concluded in Section 6.

Throughout the paper, we will restrict our focus on mutual information for finite discrete random variables.

2. Basic Concepts in Probability Theory

Let us review some basic concepts of probability theory. They can be found in many books in probability theory such as [12].

Definition 1. A probability space is a triple , where (1) is a set, called a sample space, and elements of are denoted by and are called outcomes,(2) is an -field consisting of all subsets of , and elements of are called events,(3) is called a probability measure, and it is a mapping from to with such that if are pairwise disjoint,

Definition 2. A discrete probability space is a probability space such that is finite or countable: . In this case, is chosen to be all the subsets of , and the probability measure can be defined in terms of a series of nonnegative numbers whose sum is 1. If is any subset of , then In particular,For simplicity, we will write as . From Definition 2, we see that for a discrete probability space the probability measure is characterized by the pointwise mapping : in (2). The probability of an event is computed simply by adding the probabilities of the individual points of .

Definition 3. A randomvariable on probability space is a Borel measurable function from to such that, for every Borel set , . Here, we use notation .

Definition 4. If is a random variable, then for every Borel subset of , we define a function by Then is a probability measure on and is called the probability distribution of .

Definition 5. A random variable is discrete if its range is finite or countable. In particular, any random variable on a discrete probability space is discrete, since is countable.

Definition 6. A (discrete) random variable on a discrete probability space is a Borel measurable function from to , where and is the set of real numbers. If the range of is , then function defined byis called the probability mass function of , whereas probabilities , are called the probability distribution of .

Note that, in Definition 2,Thus, a discrete random variable may be characterized by its probability mass function.

3. Various Definitions of Mutual Information

Since Shannon’s pioneering work [4], there have been various definitions for mutual information. Essentially, these definitions can be divided into two classes: (1) definitions with random variables and (2) definitions with ensembles, that is, probability spaces in the mathematical literature.

3.1. Shannon’s Original Definition

Definition 7. Let be a chance variable with probabilities , whose sum is 1. Thenis called the entropy of .
Suppose two chance variables, and , have and possibilities, respectively. Let indices and range over all the possibilities and all the possibilities, respectively. Let be the probability of and the probability of the joint occurrence of and . Denote the conditional probability of given by and conditional probability of given by .

Definition 8. The joint entropy of and is defined as

Definition 9. The conditional entropy of , , is defined asThe conditional entropy of , , can be defined similarly.

Then the following relation holds:

Definition 10. The rate of transmission of information is defined as the difference between and . Then can be written in two other forms:

Remark 11. Shannon did not derive the explicit formula for :However, he did imply it in Appendix 7 [4].

3.2. Class 1 Definitions
3.2.1. Kullback’s Definition

Kullback [5] redefined entropy more mathematically in a standalone homework question as follows. Consider two discrete random variables , where Define the joint entropy, entropy, and conditional entropy as follows: Then and .

3.2.2. Information Conveyed

Ash [7] began with two random variables and and assumed and had the same probability space. He systematically defined the entropy, conditional entropy, and joint entropy following Shannon’s path in [4]. At the end, he denoted by and called it the information conveyed about by .

3.2.3. Information of One Variable with respect to the Other

Pinsker [6] treated the fundamental concepts of Shannon in a more advanced manner by employing probability theory. Suppose is a random variable defined on a probability space and is taking values in a measurable space and is a random variable defined on a probability space and is taking values in a measurable space . Then, the pair , of random variables may be regarded as a single random variable with values in the product space of all pairs with , The distribution of is called the joint distribution of random variables and . By the product of the distributions and , denoted by , we mean the distribution defined on for and . If the joint distribution coincides with the product distribution the random variables and are said to be independent. If and are discrete random variables, say, and contain countably many points, and , then is called the information of and with respect to the other.

3.2.4. A Modern Definition in Information Theory

Of the various definitions of mutual information, the most widely accepted, of recent years, is the one by Cover and Thomas [8].

Let be a discrete variable with alphabet and probability mass function , . Let be a discrete variable with alphabet and probability mass function , Suppose and have a joint mass function (joint distribution) . Then the mutual information can be defined as

3.3. Class 2 Definitions

In Class 2 definitions, random variables are replaced by ensembles and mutual information is so-called average mutual information. Gallager [11] adopted a more general and more rigorous approach to introduce the concept of mutual information in communication theory. Indeed, he combined and compiled the results from Fano [9] and Abramson [10].

Suppose that discrete ensemble has a sample space and discrete ensemble has a sample space . Consider the joint sample space  . A probability measure on the joint sample space is given by the join probability , defined for , . The combination of a joint sample space and probability measure for outcomes and is called a joint ensemble. Then the marginal probabilities can be found asIn more abbreviated notation, this is written asLikewise, In more abbreviated notation, this is written asIf , the conditional probability that outcome is , given that outcome of is , is defined asThe mutual information between events and is defined asSince the mutual information defined above is a random variable on the joint ensemble, the mean value, which is called the average mutual information denoted by , is given by

Remark 12. By means of an information channel consisting of a transmitter of alphabet with elements and total elements and a receiver of alphabet with elements and total elements , Abramson [10] denoted by and called it mutual information of and .

The mutual information between 2 continuous random variables and [8] (also called rate of transmission in [1]) is defined as where is the joint probability density function of and and and are the marginal density functions associated with and , respectively. The mutual information between 2 continuous random variables is also called the differential mutual information.

However, the differential mutual information is much less popular than its discrete counterpart. On the one hand, the joint density function involved is unknown in most cases and hence must be estimated [13, 14]. On the other hand, data in engineering and machine learning are mostly finite, and so mutual information between discrete random variables is used.

4. A New Unified Definition of Mutual Information

In Section 3, we reviewed various definitions of mutual information. Shannon’s original definition laid the foundation of information theory. Kullback’s definition used random variables for the first time and was more mathematical and more compact. Although Ash’s definition followed Shannon’s path, it was more systematic. Pinsker’s definition was most mathematical in that it employed probability theory. Gallager’s definition was more general and more rigorous in communication theory. Cover and Thomas’s definition is so succinct that it is now a standard definition in information theory.

However, there are some mathematical flaws in these various definitions of mutual information. Class 2 definitions redefine marginal probabilities from the joint probabilities. As a matter of fact, the marginal probabilities are given from the ensembles and hence should not be redefined from the joint probabilities. Except Pinsker’s definition, Class 2 definitions either neglect the probability spaces or assume the two random variables have the same probability space. Both Class 1 definitions and Class 2 definitions assume a joint distribution or a joint probability measure exists. Yet, they all ignore an important fact that the joint distribution or the joint probability measure is not unique.

4.1. Unified Definition of Mutual Information

Let be a finite discrete random variable on discrete probability space with and range with . Let be a discrete random variable on probability space with and range with

If and have the same probability space , then the joint distribution is simply However, when and have different probability spaces and so different probability measures, the joint distribution is more complicated.

Definition 13. The joint sample space of random variables and is defined as the product of all pairs , and . The joint -field is defined as the product of all pairs , where and are elements of and respectively. A joint probability measure of and is a probability measure on such that for any and is called the joint probability space of and , and for and the joint distribution of and .

Combining Definitions 2 and 13, we immediately obtain the following results.

Proposition 14. A sequence of nonnegative numbers whose sum is 1 can serve as a probability measure on . The probability of any event is computed simply by adding the probabilities of the individual points of . If, in addition, for and , the following hold:then is a joint distribution of and .

For convenience, from now on we will shorten as .

This two-dimensional measure should not be confused with one-dimensional joint distribution when and have the same probability space.

Remark 15. If , instead of using the two-dimensional measure , we may use the one-dimensional measure . Then, (26) always hold. In this sense, our new definition of joint distribution reduces to the definition of joint distribution with the same probability space.

Definition 16. The conditional probability , given , is defined as

Theorem 17. For any two discrete random variables, there is at least one joint probability measure called the product probability measure or simply product distribution.

Proof. Let random variables and be defined as before. Define a function from to as follows: ThenHence, can serve as a probability measure on by Definition 2. The probability of any event is computed simply by adding the probabilities of the individual points of . Moreover, for any of elements,Similarly, for any . Hence, is a joint probability measure of and by Definition 13.

Definition 18. Random variables and are said to be independent under a joint distribution if coincides with the product distribution

Definition 19. The joint entropy is defined as

Definition 20. The conditional entropy is as follows:

Definition 21. The mutual information between and is defined as

As other measures in information theory, the base of logarithm in (34) is left unspecified. Indeed, under one base is proportional to that under another base by the change-of-base formula. Moreover, we take to be 0. This corresponds to the limit of as goes to 0.

It is obvious that our new definition covers Class 2 definitions. It also covers Class 1 definitions by the following arguments. Let and . Define random variables and as one-to-one mappings as Then we haveIt is worth noting that our new definition of mutual information has some advantages over various existing definitions. For instance, it can be easily used to do feature selection as seen later. In addition, our new definition leads different values for different joint distribution as demonstrated in the following example.

Example 22. Assumerandom variables and have the following probability distributions: We can generate four different joint probability distributions to lead 4 different values of mutual information. However, under all the existing definitions, a joint distribution must be given in order to find mutual information: (1), , , , , ;(2), , , , , ;(3), , , , , ;(4), , , , , .

4.2. Properties of Newly Defined Mutual Information

Before we discuss some properties of mutual information, we first introduce Kullback-Leibler distance [8].

Definition 23. The relative entropy or Kullback-Leibler distance between two discrete probability distributions and is defined as

Lemma 24 (see [8]). Let and be two discrete probability distributions. Then with equality if and only if for all .

Remark 25. The Kullback-Leibler distance is not a true distance between distributions since it is not symmetric and does not satisfy the triangle inequality either. Nevertheless, it is often useful to think of relative entropy as a “distance” between distributions.

The following property shows that mutual information under a joint probability measure is the Kullback-Leibler distance between the joint distribution and the product distribution .

Property 1. Mutual information of random variables and is the Kullback-Leibler distance between the joint distribution and the product distribution .

Proof. Using a mapping from 2-dimensional indices to one-dimensional index, and using another mapping from one-dimensional index back to two-dimensional indices, we rewrite as Since we obtain

Property 2. Let and be two discrete random variables. The mutual information between and satisfies with equality if and only if and are independent.

Proof. Let us use the mappings between two-dimensional indices and one-dimensional index in the proof of Property 1. By Lemma 24, with equality if and only if for , that is, for and , or and are independent.

Corollary 26. If is a constant random variable, that is, , then, for any random variable ,

Proof. Suppose the range of is a constant and the sample space has only one point . Then, . For any Thus, and are independent. By Property 2, .

Lemma 27 (see [8]). Let be discrete random variables with values. Thenwith equality if and only if the values are equally probable.

Property 3. Let and be two discrete random variables. Then the following relationships among mutual information, entropy, and conditional entry hold:

Proof. ConsiderCombining the above properties and noting that and are both nonnegative, we obtain the following properties.

Property 4. Let and be 2 discrete random variables with and values, respectively. Then Moreover, if and only if and are independent.

5. Newly Defined Mutual Information in Machine Learning

Machine learning is the science of getting machines (computers) to automatically learn from data. In a typical learning setting, a training set contains examples (also known as samples, observations, or records) from an input space and their associated output values from an output space (i.e., dependent variable). Here, are called features, that is, independent variables. Hence, can be expressed as where feature has values for .

A fundamental objective in machine learning is to find a functional relationship between input and output . In general, there are a very large number of features, many of which are not needed. Sometimes, the output is not determined by the complete set of the input features . Rather, it is decided by only a subset of them. This kind of reduction is called feature selection. Its purpose is to choose a subset of features to capture the relevant information. An easy and natural way for feature selection is as follows.(1)Evaluate the relationship between each individual input feature and the output .(2)Select the best set of attributes according to some criterion.

5.1. Calculation of Newly Defined Mutual Information

Since mutual information measures dependency between random variables, we may use it to do feature selection in machine learning. Let us calculate mutual information between an input feature and output . Assume has different values . If has missing values, we will use to represent all the missing values. Assume has different values .

Let us build a two-way frequency or contingency table by making as the row variable and as the column variable like in [8]. Let be the frequency (could be 0) of () for to and to . Let the row and column marginal totals be and , respectively. Then Let us denote the relative frequency by . We have the two-way relative frequency table; see Table 2.

Since , , and can each serve as a probability measure.

Now we can define random variables for and as follows. For convenience, we will use the same names and for the random variables as , where and for . Note that could be any real numbers as long as they are distinct to guarantee that is a one-to-one mapping. In this case, .

Similarly, as , where and for . Also, could be any real numbers as long as they are distinct to guarantee that is a one-to-one mapping. In this case, .

Now define a mapping from to as follows:Since is a joint probability measure by Proposition 14. Finally, we can calculate mutual information as follows: It follows from Corollary 26 that if has only one value, then . On the other hand, if has all distinct values, the following result shows that mutual information will reach the maximum value.

Proposition 28. If all the values of are distinct, then .

Proof. If all the values of are distinct, then the number of different values of equals the number of observations; that is, . From Tables 1 and 2, we observe that(1) or 1 for all and ,(2) or for all and ,(3)for each , since , there are nonzero ’s or equivalently nonzero ’s,(4), .Using the above observations and the fact that we have

5.2. Applications of Newly Defined Mutual Information in Credit Scoring

Credit scoring is used to describe the process of evaluating the risk a customer poses of defaulting on a financial obligation [1519]. The objective is to assign customers to one of two groups: good and bad. Machine learning has been successfully used to build models for credit scoring [20]. In credit scoring, is a binary variable, good and bad, and may be represented by 0 and 1.

To apply mutual information to credit scoring, we first calculate mutual information for every pair of and then do feature selection based on values of mutual information. We propose three ways.

5.2.1. Absolute Values Method

From Property 4, we see that mutual information is nonnegative and upper bounded by and that if and only if and are independent. In this sense, high mutual information indicates a large reduction in uncertainty, while low mutual information indicates a small reduction. In particular, zero mutual information means the two random variables are independent. Hence, we may select those features whose mutual information with is larger than some threshold based on needs.

5.2.2. Relative Values

From Property 4, we have . Note that is relative mutual information, which measures how much information catches from . Thus, we may select those features whose relative mutual information is larger than some threshold between 0 and 1 based on needs.

5.2.3. Chi-Square Test for Independency

For convenience, we will use the natural logarithm in mutual information. We first state an approximation formula for the natural logarithm function. It can be proved by the Taylor expansion like in Kullback’s book [5].

Lemma 29. Let and be two positive numbers less than or equal to 1. Then The equality holds if and only if . Moreover, the close is to , the better the approximation is.

Now let us denote by . Then, applying Lemma 29, we obtainThe last equation means the previous expression follows distribution. According to [5], it follows distribution with a degree of freedom of . Hence, approximately follows distribution with a degree of freedom of . This is the well-known Chi-square test for independence of two random variables. This allows using the Chi-square distribution to assign a significant level corresponding to the values of mutual information and .

The null and alternative hypotheses are as follows.: and are independent (i.e., there is no relationship between them).: and are dependent (i.e., there is a relationship between them).The decision rule is to reject the null hypothesis at the level of significance if the statisticis greater than , the upper-tail critical value from a Chi-square distribution with degrees of freedom. That is,Take credit scoring, for example. In this case, . Assume feature has 10 different values; that is, . Using a level of significance of we find to be 16.9 from a Chi-square table with , and select this feature only if .

Assume a training set has examples. We can do feature selection by the following procedure.(i)Step 1. Choose a level of significance of , say, 0.05.(ii)Step 2. Find , the number of values of feature .(iii)Step 3. Build the contingency table for and .(iv)Step 4. Calculate from the contingency table.(v)Step 5. Find with degrees of freedom from a Chi-square table or any other sources such as SAS.(vi)Step 6. Select if and discard it otherwise.(vii)Step 7. Repeat Steps 2–6 for all features.If the number of features selected from the above procedure is smaller or larger than what you want, you may adjust the level of significant and reselect features using the procedure.

5.3. Adjustment of Mutual Information in Feature Selection

In Section 5.2, we have proposed 3 ways to select feature based on mutual information. It seems that the larger the mutual information , the more dependent on . However, Proposition 28 says that if has all distinct values, then will reach the maximum value and will reach the maximum value 1.

Therefore, if has too many different values, one may bin or group these values first. Based on the binned values, mutual information is calculated again. For numerical variables, we may adopt a three-step process.(i)Step 1: select features by removing those with small mutual information.(ii)Step 2: do binning for the rest of numerical features.(iii)Step 3: select features by mutual information.

5.4. Comparison with Existing Feature Selection Methods

There are many other feature selection methods in machine learning and credit scoring. An easy way is to build a logistic model for each feature with respect to the dependent variable and then select features with values less than some specific values. However, this method does not apply to any nonlinear models in machine learning.

Another easy way of feature selection is to calculate the covariance of each feature with respect to the dependent variable and then select features whose values are larger than some specific value. Yet, mutual information is better than the covariance method [21] in that mutual information measures the general dependence of random variables without making any assumptions about the nature of their underlying relationships.

The most popular feature selection in credit scoring is done by information value [1519]. To calculate information between an independent variable and the dependent variable, a binning algorithm is used to group similar attributes into a bin. Difference between the information of good accounts and that of bad accounts in each bin is then calculated. Finally, information value is calculated as the sum of information differences of all bins. Features with information value larger than 0.02 are believed to have strong predictive power. However, mutual information is a better measure than information value. Information value focuses only on the linear relationships of variables, whereas mutual information can potentially offer some advantages information value for nonlinear models such as gradient boosting model [22]. Moreover, information value depends on binning algorithms and the bin size. Different binning algorithms and/or different bin sizes will have different information value.

6. Conclusions

In this paper, we have presented a unified definition for mutual information using random variables with different probability spaces. Our idea is to define the joint distribution of two random variables by taking the marginal probabilities into consideration. With our new definition of mutual information, different joint distributions will result in different values of mutual information. After establishing some properties of the new defined mutual information, we proposed a method to calculate mutual information in machine learning. Finally, we applied our newly defined mutual information to credit scoring.

Conflict of Interests

The author declares that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

The author has benefited from a brief discussion with Dr. Zhigang Zhou and Dr. Fuping Huang of Elevate about probability theory.