Abstract

Testing the number of components in a finite mixture is considered one of the challenging problems. In this paper, exponential finite mixtures are used to determine the number of components in a finite mixture. A sequential testing procedure is adopted based on the likelihood ratio test (LRT) statistic. The distribution of the test statistic under the null hypothesis is obtained using a resampling technique based on B bootstrap samples. The quantiles of the distribution of the test statistic are evaluated from the B bootstrap samples. The performance of the test is examined through the empirical power and application on two real datasets. The proposed procedure is not only used for testing the number of components but also for estimating the optimal number of components in a finite exponential mixture distribution. The innovation of this paper is the sequential test, which tests the more general hypothesis of a finite exponential mixture of components versus a mixture of components. The special case of testing an exponential mixture of one component versus two components is the one commonly used in the literature.

1. Introduction

The exponential distribution, which is analytically very simple, plays an important role in reliability and lifetesting analogues to the normal distribution in other areas. Consequently, the exponential distribution became a basic model to research associated with experiments on life expectancy. Applications of the exponential distribution include designing acceptance sampling plans [1], estimation of reliability in multicomponent stress-strength [2], and construction of multivariate control chart [3]. Also, neutrosophic statistics is applied when the data have uncertain parameters or values [4, 5]. A reason to study this distribution in mixtures is related to the lack of memory property.

Many authors have discussed mixture models such as Everitt and Hand [6]; Titterington et al. [7]; McLachlan and Basford [8]; Lindsay [9]; McLachlan and Krishnan [10]; and McLachlan and Peel [11].

Let be a random sample, arising from a mixture of finite exponential distribution (MFED), whose density iswhere is the number of exponential components, , and represents the density function of the component.

Inference on the number of components can be conducted through statistical tests such as likelihood ratio tests. Some papers have dealt with bootstrapping the LRT such as McLachlan [12] and Feng and McCulloch [13] who used the bootstrap resampling for the number of one normal distribution against a mixture of two normal distributions. Also, Feng and McCulloch [13] who noted the bootstrap resampling for the number of normal mixture with difference variances is a preferred method. Seidel, Mosler and Alker [14, 15], who used a mixture of two exponential distributions; and Sultan, Ismail and Al-Moisheer [16]; who used a mixture of two inverse Weibull distributions. The discrete Poisson distribution was used by Karlis and Xekalaki [17]. Some criteria are used to choose the number of components in finite mixtures models such as McLachlan and Peel [11]. Various authors have suggested the simplest form of testing in LRT for a single component against a two-component model. Here, the test procedure is proposed for components against components.

This paper is arranged into six sections. In Section 2, an algorithm is presented to determine the number of finite exponential components by using bootstrapping the LRT in R software packages. Section 3 contains the simulation results from computing the quantiles for an estimated number of finite exponential components using a sequential test. In Section 4, we evaluate the power of the sequential test when determining the number of finite exponential components. Criteria based on the likelihood are applied to determine (estimate) the number of components in finite mixtures models such as the Akaike information criterion (AIC), the Bayesian information criterion (BIC), the Hannan–Quinn information criterion (HQIC), and the consistent Akaike information criterion (CAIC). In Section 5, the sequential test procedures are presented. Finally, in Section 6, the conclusion results are shown for the sequential testing number of finite exponential components.

2. Determining the Number of Components in the Finite Exponential Mixture

In this section, we use a sequential test to specify the number of components in the finite exponential mixture by using a resampling procedure called bootstrap. McLachlan [12] and Feng and McCulloch [13] discussed the idea of bootstrapping the LRT. The general method for determining the number of components is based on the LRT. The LRT statistic is used as appropriate test statistic for testing hypotheses. The test statistic is defined as , where represents the ratio between the maximized likelihood functions under the null hypothesis and the alternative hypothesis , respectively, . Equivalently, the test statistic can be written as , where is the maximum likelihood estimator MLE for the parameter .

Consider the hypothesis : the number of components in the exponential finite mixture is against : the number of components in the exponential finite mixture is . The procedure of testing is sequential for using the LRT statistic. Bootstrapping the LRT is as follows.

Set .(1).Find the MLE of the parameters , of the finite exponential mixture for and and calculate the LRT statistic which is referred to as . For the case , the MLE of , , the sample mean.(2)Generate a bootstrap sample of size ( is the sample size) from the exponential mixture of components and calculate the value of after obtaining the MLEs of under and . The EM algorithm for a finite mixture of exponential distribution is used as mentioned in [15], as follows:where and are represents in equations (1) and (2), respectively.(3)The process is repeated independently a number of times .(4)The value of the order statistics of the replications of can be taken as an estimate of the quantiles of the order .(5)The bootstrap replications can also be used to provide a test of approximate size .(6)If the order statistic of the replications of , then the null hypothesis is rejected and we set and go to step 1. Otherwise, the optimal number of components is and the test is terminated. The bootstrap replications can also be used to provide a test of an approximate size where . For more details, see McLachlan and Peel [11].

3. Simulation Results

In order to find the LRT under the null hypothesis , we use the simulated data for the sequential test from a mixture of univariate exponential distributions. Accordingly, we use the package for R, which provides a set of functions to analyze a variety of finite mixture models. The package is used to generate a random sample for a mixture of univariate exponential distributions. Then, we require the MLE of the mixing distribution. To find the best fit of , the function can be used to find the best fit for the model.

The number of the finite exponential components test is applied to choose the number of components . To calculate the quantiles of the LRT tests, we simulate the null distribution of for the sample sizes at , and 100 according to the stopping criteria described in [14] by using as the level of accuracy. Each distribution is generated for the parameters in the model with 500 bootstrap samples. Tables 13 present the quantiles at the significance levels , , and . The test always rejected for each and the number of components at sample size , and the test was repeated for five times, as shown in Tables 2 and 3. The epitome from the simulation is at , sample sizes are , and the choices of parameters are (0.8, 0.15, 0.5, 2, 4). Then, the best estimates for the optimal values for are , respectively. Further, simulation results depend on the following factors.

As shown in Tables 13, to calculate the quantiles, when the sample sizes increase, so do the values of the quantiles; for example, in Table 1, at in the choice of parameter (0.9, 0.35, 5) for sample size , we get the result of for the five repetitions and once at the significant level of , while for , we get the results for the significant levels of . The same goes for and (see Figure 1; see also , which shows how the sample size affects the value of the quantiles and the level of acceptance for ). With respect to the initial values of the parameters, as the initial vector consists of the simulation results reveal that when there is a large difference between the initial of the parameter the number of rejected ones decreases and get the quantile results for high values of . This is clear at with the parameters and the maximum number of accepted quantiles is in , while in the parameters choices , the accepted values of the LRT at the levels of significance are and 0.5 for sample size . For the mixing proportion, when the initial value of or the sum of is closer to 1, the number of rejected ones decreases. This is clear when we compare the results of the parameters at . Then, the accepted values of the level of significance are up to for a large sample size . On the other hand, for the mixing proportions , the accepted values of are up to 0.5 in the same sample. Finally, according to the increasing number of components and the increasing levels of significance (), it is obvious in the tables for the LRT that the maximum level of significance () at is 0.9 and at it is 0.5, even though at , it is 0.1.

4. The Power of the Sequential Test

The empirical power of the sequential test for the components against is evaluated for sample sizes , and 100 and . The empirical power is defined as the proportion of times was rejected when the data were generated under . The power is simulated for 500 bootstrap samples and the different choices of parameters. Also, the power is calculated for the significance levels at , and .

For each case, we study the effect of increasing the distance between the parameters on the power of the test. Also, we study the effect of increasing the mixing proportion and the sample size for the test. The power results are shown in Tables 46. For each case, when the sample size increases, the powers improve for every component . To test the components against , the power is increased for large sample sizes, starting from and over. For the components against for small sample sizes of , the power is always decreased. Performances of empirical power are affected by the sample sizes and not the reverse (see Figures 24).

5. Application

The sequential test procedure is applied in two real data as follows.

5.1. Application (1)

The data considered in this application are given by Maswadah [18]. It represents the maximum flood levels (in millions of cubic feet per second) of the Susquehanna River in Harrisburg, Pennsylvania, over 20 four-year periods (1890–1969). The observations are as follows:0.654, 0.613, 0.315, 0.449, 0.297, 0.402, 0.379, 0.423, 0.379, 0.324, 0.269, 0.740, 0.418, 0.412, 0.494, 0.416, 0.338, 0.392, 0.484, 0.265.

5.2. Application (2)

It is an application of the sequential mityres of an exponential test for fitting exponential mixtures. According to Smith and Naylor [19], the following data represent the strength of 1.5 cm glass fibers.

Data n = 63.0.55, 0.93, 1.25, 1.36, 1.49, 1.52, 1.58, 1.61, 1.64, 1.68, 1.73, 1.81, 2.00, 0.74, 1.04, 1.27, 1.39, 1.49, 1.53, 1.59, 1.61, 1.66, 1.68, 1.76, 1.82, 2.01, 0.77, 1.11, 1.28, 1.42, 1.50, 1.54, 1.60, 1.62, 1.66, 1.69, 1.76, 1.84, 2.24, 0.81, 1.13, 1.29, 1.48, 1.50, 1.55, 1.61, 1.62, 1.66, 1.70, 1.77, 1.84, 0.84, 1.24, 1.48, 1.30, 1.51, 1.55, 1.61, 1.63, 1.67, 1.70, 1.78, 1.89.

We apply the sequential testing of the number of components in an exponential finite mixture for the above two real datasets for sample sizes and 63, respectively. The sequential results for the two applications are given in Tables 7 and 8. Column 1 in Tables 7 and 8 contains the number of components in finite mixtures of exponentials. Column 2 contains the values of the LRT statistics for testing versus . The test’s p values are obtained between 0 and 1 by using 500 bootstrap samples, as described previously. The last four columns contain some information criteria that are used to choose the number of components in finite mixtures models, such as AIC, BIC, HQIC, and CAIC. These criteria also agree with the results and the values. The values are obtained through simulation under of the LRT. Our results in Tables 7 and 8 lead to the selection of the mixture model with one component. It can also be seen from Tables 7 and 8 that the LRT increases when the number of components decreases, while the information criteria increase as the number of components increases. Briefly, the best mixture models at because it has the largest value and the minimum values for the four criteria that are used (Algorithm 1).

6. Conclusion

In this paper, the sequential testing of the number of components in exponential finite mixture is discussed. Simultaneously, the optimal number of components for a finite exponential mixture is determined to provide the appropriate fit with the data. A resampling approach to determine the number of components is used via B bootstrap samples. Bootstrap samples are generated from the finite exponential components under . The value of is evaluated for each bootstrap sample. This process is repeated for 500 bootstraps to obtain the order statistic as an estimate of the quantile. The power results for the estimated number of finite exponential components are computed. Two applications of real data are used to illustrate the sequential test. Thus, the innovation in this sequential test is that it permits the testing of the hypothesis of components in an exponential mixture against components, along with the determination of the optimal exponential mixture. It thus provides a general method than the one commonly used in exponential mixtures which focuses on testing one component versus a mixture of two components. The importance of this sequential test lies in cluster analysis and other applications. Finally, it is clear that our sequential test, which was applied to finite exponential mixtures, can be applied to finite mixture models from any family of mixtures.

Put
Step 1: find the MLEs of the parameters of the finite exponential components for and , respectively, by using the code in R.
For
Other
and calculate the LRT statistic, say .
Step 2: declare the initial various variables , .
Step 3: generate a random sample of size , for the mixture of univariate exponential distributions ,
Step 4: declare the function LRT to calculate ,, and then calculate the LRT by using the EM algorithm equation for finite exponential components, as mentioned in [15].
Step 5: simulate 500 bootstrap samples of size with initial vector parameters ,, and for each bootstrap sample, calculate , for the LRT using the boot package functions in R.
Step 6: find the control variate estimates from a bootstrap output object. Estimate the quantiles using the linear approximation as a control variate.
Step 7: estimate the p values in result

Data Availability

The data used to support the findings of this study were obtained from Maswadah [18] and Smith and Naylor [19].

Conflicts of Interest

The author declares that there are no conflicts of interest.