Regularized Feature Selection in Categorical PLS for Multicollinear Data

Mehmood, Tahir

doi:https://doi.org/10.1155/2021/5561752

Mathematical Problems in Engineering

On this page

Abstract Introduction Methods Results and Discussion Conclusion Data Availability Conflicts of Interest References Copyright Related Articles

Research Article | Open Access

Volume 2021 | Article ID 5561752 | https://doi.org/10.1155/2021/5561752

Regularized Feature Selection in Categorical PLS for Multicollinear Data

Tahir Mehmood¹

Academic Editor: Xiaoheng Chang

Received21 Feb 2021

Revised10 Apr 2021

Accepted07 May 2021

Published17 May 2021

Abstract

Article presents the algorithm which models the categorical multicollinear data by providing the balance in model accuracy on test data and number of selected features in the model. In all scientific fields, multicollinear data is being generated, where obviously some variables are noise and some are influential reference to response variable. Features and response appeared to be categorical in mathematical and statistical modeling of public health data. These datasets usually appeared to collinear, where partial least squares (PLS) is the potential method, which is not feature selection at its default level and deals with quantitative features. Recently, categorical PLS (Cat-PLS) is introduced. We have implemented the regularized feature selection in Cat-PLS where filter-based feature selection and categorical mean through Cramer’s V, Phi coefficient, Tschuprow’s T coefficient, Contingency Coefficient, and Yule’s Q and Yule’s Y are used. Monte carlo simulation with 100 runs indicates is the better choice in terms of better model performance, number of feature selection, and interpretations for modeling the stillbirths, which is taken as the case study. The framework can be used in related areas to explore and model the related data structures.

1. Introduction

Sciences are experiencing the multicollinear datasets where the task is to establish the meaningful relation for better understanding and interpretation of real life process [1–4]. Like other sciences, in public health, the selection of influential features is of the researcher’s interest [2, 5, 6] which explains the variation in response or outcome , where is sample size and is number of features. Here, data usually comprises of correlated features having categorical nature [7]. Logistic regression-type models are the famous candidates for modeling the categorical response [8]. In presence of multicollinearity, logistic regression estimates’ variance gets so large; hence, logistic regression does not work optimally if the features are correlated. Ridged logistic regression controls variance of the estimates but is unable to find the influential features [9] and is not designed for categorical . Alternative to the logistic-type regression model is Partial Least Square (PLS) regression, which is a statistical learning approach specifically designed to model the correlated [10]. Developments in PLS algorithm is going on with passage of time, and a latest contribution is the up gradation of PLS algorithm for categorical features [7], called it Categorical PLS (CPLS). In PLS, loading weights have a pivotal role in model building which reflects the mutual correlation of a respective feature with response . The PLS loading weights is somehow closer to Pearson coefficient of correlation which is being replaced with Cramer’s V, Phi coefficient, Tschuprow’s T coefficient, Contingency Coefficient, and Yule’s Q and Yule’s Y correlation measure in CPLS. Hence, CPLS is a potential candidate for modeling the categorical response with multicollinear . In PLS, several methods have been proposed for influential feature selection and are reviewed in [11]. Feature selection in PLS can be grouped into three broader categories which are filter, embedded, and wrapper. Filter feature selection is a two-step procedure where, at the first stage, PLS is fitted and, at the second stage, filter measure is computed. Features above a threshold are marked as influential. In embedded feature selection, filter selection is embedded in iterative computational loop of PLS. In wrapper feature selection, an external loop is considered over filter selection, whereas in each loop, selection is carried out, the model is updated, and performance is evaluated. All three types of feature selection methods have their own advantages and disadvantages. For instance, filter methods are very fast but may have low performance. Embedded and wrapper methods are relatively time expensive but are expected to perform better compared to filter methods. Regularized elimination procedure for feature selection in PLS [12] is a potential candidate from wrapper selection methods. Regularized elimination procedure in PLS selects a model next to optimal, given that the selected model’s performance is not significantly different from the optimal model, whereas the selected model has fewer features compared to the optimal model. In this article, we have proposed the modification in regularized elimination procedure over categorical PLS instead of standard PLS.

We have implemented the proposed regularized elimination in categorical PLS over the still birth, which is crucial issue in developing countries. Although considerable progression has been observed in the last 25 years [13], but still the issue needs considerable attention [14]. Several surveys cover the issues regarding stillbirth in Pakistan, but there is need to determine the causes of stillbirth [15]. The most comprehensive and reliable source to have data related to still births and related features is Pakistan Demographic and Health Survey (PDHS) conducted by the National Institute of Population Studies (NIPS) and is technically assisted and funded by USAID. Although the case study taken here is from public health, but the proposed method is applicable over the categorical multicollinear data. Possibilities include, engineering, robotics, gaming, chemometrics, and bioinformatics.

2. Methods

2.1. Data Set

The data set was obtained from the Pakistan Demographic and Health Survey (PDHS) (https://www.nips.org.pk/PDHS-Data-Set.htm) from 2017–18, which was designed to provide population and health indicators at the national and regional levels. The sample design contained specific indicators for each of the five provinces (Punjab, Sindh, Khyber Pakhtunkhwa (KPK), Balochistan, and Gilgit Baltistan) of Pakistan. According to WHO definition of late fetal deaths for international standards, the sample of stillbirth for this study was restricted to birth of 28 or more weeks of gestation. Women with incomplete information were excluded from the sample, and then, 752 women who experienced stillbirths and 1504 women who had live births were included in the analysis. The sampled women included in the case group were mothers of newborn babies without signs of life after at least 28 weeks of pregnancy, while women included in the control group had live births. The response variable of this study was the occurrence (labeled as 1) or nonoccurrence (labeled as 0) of stillbirths among women of child bearing age (15–49 years). Maternal features such as socio, economic, and other health features related to still births were taken as explanatory features, i.e., matrix.

2.2. Categorical Partial Least Squares (CPLS)

Categorical Partial Least Squares (CPLS) [7] models the categorical data set which is the upgradation of standard PLS [10]. The algorithm for CPLS starts with centered features’ data and response . CPLS is an iterative procedure based on iterations called components. In each CPLS component, loading weights, score vectors, matrix, loadings, and deflated and are computed as(1)Loading weights can be defined through Cramer’s V [16], Phi coefficient [16], Tschuprow’T coefficient [17], Contingency Coefficient [18], and Yule’s Q and Yule’s Y [19] as where is derived from Pearson’s chi-squared test, is the total number of observations, and and denote number categories, respectively, in response and in respective feature: which is referred as mean square contingency coefficient: which is the refined form of Phi loading weights, where and denote the number of categories, respectively, in response and in respective feature. is the mean square contingency defined as which is the proportion of the sample in the cell of the contingency table: which measures the strength of association between categorical features: determines the strength of relationship between feature and the response based on odds’ ratio (OR). Normalizing the loading weights, .(2)Compute the score vector by .(3)Computing the X-loading through regressing on the score vector, . Similarly, compute the Y-loading through(4)Deflate and by subtracting the involvement of :(5)If , come back to 1. From each component-computed loading weights, score vector, loadings, and deflated data are stored in respective matrices/vectors , , , and .

2.3. Regularized Elimination in CPLS

Regularized elimination is the wrapper feature selection method. The current version is modification and simplification of regularize elimination in PLS [12]. Here, we need to attach the filter measures with CPLS in a wrapper function. We have considered the following filter measures, which reflects the level of importance of each explanatory feature for response:(i)Loading weights (LW): PLS LW reflect the covariance of feature with response; hence, the importance of feature at CPLS component can be measured by . Features having for some user defined fixed threshold can be eliminated from the model.(ii)Regression coefficients (RC): RC is an established and well-known measure for feature selection defined as . Features having for some user-defined fixed threshold can be eliminated from the model.(iii)Variable importance on projections (VIP): VIP for the feature is defined according to [20] as where , is the loading weight for feature using components, and , , and are, respectively, CPLS scores, loading weights, and -loadings, respectively, corresponding to the component. Feature can be eliminated if for some user-defined threshold .(iv)Selectivity ratio (SR): SR is based on the target projection approach [21] which is postprojection of the predictor features onto the fitted response vector from the estimated model. For each feature , can be computed as where is the explained variance and is the residual variance for feature from the target projection model. Feature can be eliminated if for some user-defined threshold (v)Significance multivariate correlation (SMC): SMC [22] is defined as the ratio of mean square of regression compared to mean square residuals: where feature can be eliminated if for some user-defined threshold .

Once the filter measures are defined, the elimination procedure for removing the ‘worst’ features from the CPLS model is presented here. Let and be any filter measure from , , , or .(1)For iteration , run y and through cross-validated CPLS and performance is computed. The matrix has columns, and for used filter measure, we get criterion values which are sorted as .(2)There will criterion values below the threshold , i.e., number of noninfluential features. Let for some fraction . Eliminate the features corresponding to the most extreme criterion values.(3)If there are still more than one feature left, let contain these features, and return to (1).

The fraction determines the part of the elimination algorithm, where small will eliminate few features from each iteration. With each iteration, number of influencing features in decreases, but the performance may increases or decrease. The increase in performance is because of removal of noise features and decrease in performance is because of relevant features. After the optimal iteration with performance , there is reduction in the number features against the modest drop in performance. Hence, by eliminating beyond , one could have a much simpler model with small loss of performance. To conduct this regularization, McNemar test can be used. The prediction for the optimal model is compared with the models next to the optimal model beyond iteration. If the prediction difference is not significant and this happens over several iterations beyond , then the selected model is the one having the least number of features.

2.4. Model Fitting and Validation

The regularized elimination in CPLS has several parameters to tune, for instance, elimination fraction , number of CPLS components , and threshold used for filter measure . Elimination fraction affects the number of iteration in regularized elimination in CPLS; hence, it mostly affects over the computational time, so we can take , which means, in each iteration, we are eliminating only 10 % of extreme criterion features. and can effect the model performance; hence, they need to tune. We have considered and distribution-based 4 levels, i.e., quantiles for . For this, we first computed the respective filter measure for all features in the model; then, the 4 quantiles of the filter measure were used as different levels of . For fitted model’s evaluation and parameter tuning, we have adopted the cross-validation procedure. For this, full data set was divided into training (70 %) and test (30 %) data. Using training data, the CPLS model was fitted against all possible combination of and and performance on test data was computed. We have used accuracy as performance measure, i.e., how good CPLS predicts the response on test data. Since split of data into test and training is random, to minimize the effect of this randomness, we have used Monte Carlo simulation with 100 runs, where, in each step, the CPLS model was fitted and evaluated as per above description.

2.5. Computations

All methods are implemented in the R computing environment (http://www.r-project.org/) and codes are available from corresponding author upon request.

3. Results and Discussion

The data set contains a total of 2256 births with 752 stillbirths, and we observed several outliers in the data. Moreover, for modeling, we assume the samples should be independent of each other. So, in order to remove outliers and to ensure samples are independent, we have used k-mean clustering over still birth and alive birth samples separately. We found data of 141 independent samples with 94 alive births and 47 still births with 34 features covering maternal features, placental deficiency, fetal growth limitations, fetal growth features, and congenital features related to still births which were taken as explanatory features, i.e., matrix. In CPLS, there are six categorical measure-based loading weights, i.e., Cramer’s V , Phi coefficient , Tschuprow’T coefficient , Contingency Coefficient , and Yule’s Q and Yule’s Y . Each CPLS was fitted within regularized elimination which utilizes five filter measures that is loading weights (LW), regression coefficients (RC), variable importance on projections (VIP), selectivity ratio (SR), and significance multivariate correlation (SMC). Hence, there are regularized elimination in CPLS models to fit and to compare. For this, 100 Monte Carlo simulations were executed. The response and explanatory feature matrix were divided into training (70%) and test (30%) data sets in each Monte Carlo simulation. Each of 30 regularized eliminations in the CPLS model was fitted over the training data, while test data was used to tune the model parameters and to measure the model performance that is accuracy. Hence, each model was fitted and evaluated 100 times.

Since regularized elimination in CPLS selects the model after the optimal model having nonsignificant difference in response prediction. Hence, we have two models named as the optimal model and selected model for each of 30 regularized eliminations over each Monte Carlo run in CPLS. In regularized elimination, we have used McNemar test with . Since we have used several filter measures and categorical measures in regularized categorical PLS, so we have used Kruskal–Wallis test to study their significance over the accuracy on test data. It appears the accuracy on test data is significantly varying with filter measures and is also significantly varying with categorical measures . Figure 1 presents the distribution and comparison of accuracy from both models. This indicates the CPLS based on Contingency Coefficient with filter measures , i.e., and with filter measure , i.e., have low accuracy in optimal model and consequently in the selected model. Left-hand panel of Figure 1 is a magnified view of the upper right part of the right-hand panel. This indicates CPLS based on Tschuprow’T coefficient with filter measure , i.e., , Phi coefficient with filter measure , i.e., and with filter measure , i.e., are performing with best accuracy over optimal model but having relatively low accuracy over the selected model. Yule’s Q with filter measure , i.e., and Cramer’s V with filter measure , i.e., have reasonably good performance, which is dully supported by Wilcoxon rank sum test with continuity correction .

When choosing a model for feature selection, the stability of the model is an important aspect to consider. Figure 2 presents the standard deviations of accuracy for all fitted models. The variation is smaller for , , , and . In concert with accuracy analysis and stability analysis expressed from Figures 1 and 2, respectively, we can conclude that performs good average accuracy both on the optimal model and the selected model and, at the same time, has higher stability in accuracy on the selected model since this has shown lower standard deviation of accuracy.

When it comes to the parsimonious model, sample size together with accuracy of the fitted model is important. Smaller number of the features in the fitted model presents the model is better for interpretation and understanding the real-life phenomena. The distribution of numbers of features used in the selected and optimal model is presented in Figure 3. All selected models have relativity low numbers of features compared to the optimal model, which is the expected pattern from regularized elimination in CPLS algorithm. Since filter measures mainly contribute in feature selection for CPLS, hence, distribution of feature selection measures for CPLS algorithms under consideration is important. It appears and then choose lowest number of features for all types of CPLS except for Yule’ Y and Yule’ Q. Moreover, for Yule’ Y and Yule’ Q, and are choosing lowest number of features. For further investigation, the number of PLS components presenting the complexity of the fitted model are exhibited in Figure 4. Left panel shows the distribution of number of components consumed for evaluated filter measures indicating , , and has lower level of complexity compared to and . The right panel presents the distribution of the number of components considered in categorical choice of PLS (CPLS), indicating , , and -based CPLS are relatively less complex compared to other CPLS methods.

It appears has smaller number of features in the selected model as well as having good and consistent accuracy, hence can be used as a potential candidate for modeling the occurrence or nonoccurrence of stillbirths among women of child bearing age. Influential features obtained by fitting the are presented in Table 1. The count and percentage of these features over the occurrence and nonoccurrence of still birth together with their odds’ ratio (OR) and significance is also presented. As the small subset of refined cases is considered here, hence the presented effects are not for quantification to be used for trend.

Results indicate that it is 1.51 times more likely to have still births in Baluchistan province compared to Punjab province. Punjab is more developed province compared to Baluchistan province. The health and related facilities are much better in Punjab compared to Baluchistan [23] and the same trend is reflected in the results. Similarly, it is 2.5 times more likely to have still births in rural areas compared to urban areas.

This is again more likely since cities or big towns are expected to have better health facilities. It is 0.685 times less likely to have still births if mother’s age increases from 19 to . These results support the findings reported in [24]. With nurses and traditional attendance, the chances of still birth decreases by 0.979 times and 0.901 times, respectively. Increase in antenatal care (ANC) visits upto 3 counts, the chances of still birth decreases by 0.441 times. It is 1.089 times more likely to have still births for women having pregnancy complication. The use of iron tablets during pregnancy decreases the risk of still birth by 0.916 times. If a mother is socially dependent for medical assistance, then the chances of stillbirth get increased by 2.53 times. Primary, secondary, and higher level of husband’s education decreases the risk of still birth by 0.57, 0.43, and 0.31 times, respectively, compared to illiterate husband. It is observed that working women are 0.312 times less likely to have stillbirths. Compared to 1–3 pregnancy order, a woman having 4–6 and pregnancy order are 2.46 and 5.99 times more likely to have stillbirths. Most importantly, it is reported that education reflects the socioeconomic position and improved socioeconomic status generated by higher education and better working status of women and ends with healthier mother and child [25]. Notably, the proposed method is applicable over the categorical multicollinear data only; if data conditions vary, the performance of the proposed method may vary.

4. Conclusion

A comprehensive comparison of filter-based feature selection and categorical PLS loading weights in the frame work of regularized elimination in PLS is conducted. Monte Carlo-based simulation with 100 runs indicated that is the better choice for modeling the occurrence or nonoccurrence of stillbirths in terms of improved model performance, number of feature selection, and interpretation. Influential features which affect the occurrence of still birth covers the maternal socio, economic, and health facilitation-related features. The proposed method is applicable over the categorical multicollinear data only; if data conditions vary, the performance of the proposed method may vary. The framework can be used in related areas to explore and model health-related issues.

Data Availability

The data used to support the findings of the study are available from the corresponding author upon request.

Conflicts of Interest

The author declares no conflicts of interest.

References

P. Bühlmann and S. Van De Geer, Statistics for High-Dimensional Data: Methods, Theory and Applications, Springer Science & Business Media, Berlin, Germany, 2011.
T. Mehmood, M. Sadiq, and M. Aslam, “Filter-based factor selection methods in partial least squares regression,” IEEE Access, vol. 7, pp. 153499–153508, 2019.
View at: Publisher Site | Google Scholar
D. N. Reshef, Y. A. Reshef, H. K. Finucane et al., “Detecting novel associations in large data sets,” Science, vol. 334, no. 6062, pp. 1518–1524, 2011.
View at: Publisher Site | Google Scholar
T. Smitha and V. Sundaram, “Comparative study of data mining algorithms for high dimensional data analysis,” International Journal of Advances in Engineering & Technology, vol. 4, p. 173, 2012.
View at: Google Scholar
A. Ghaffar, S. Pongponich, N. Ghaffar, and T. Mehmood, “Factors associated with utilization of antenatal care services in balochistan province of Pakistan: an analysis of the multiple indicator cluster survey (mics) 2010,” Pakistan Journal of Medical Sciences, vol. 31, pp. 1447–52, 2015.
View at: Publisher Site | Google Scholar
S. Pongpanich, A. Ghaffar, N. Ghaffar, and T. Mehmood, “Skilled birth attendance in balochistan, Pakistan,” Asian Biomedicine, vol. 10, pp. 25–34, 2016.
View at: Google Scholar
M. Sadiq, T. Mehmood, and M. Aslam, “Identifying the factors associated with cesarean section modeled with categorical correlation coefficients in partial least squares,” PLoS One, vol. 14, 2019.
View at: Publisher Site | Google Scholar
R. E. Wright, Logistic Regression, Reading and Understanding Multivariate Statistics, American Psychological Association USA, Washington, DC, USA, 1995.
D. Inan and B. E. Erdogan, “Liu-type logistic estimator,” Communications in Statistics - Simulation and Computation, vol. 42, no. 7, pp. 1578–1586, 2013.
View at: Publisher Site | Google Scholar
H. Martens and T. Næs, “Multivariate calibration,” Chemometrics, pp. 147–156, 1984.
View at: Publisher Site | Google Scholar
T. Mehmood, K. H. Liland, L. Snipen, and S. Sæbø, “A review of variable selection methods in partial least squares regression,” Chemometrics and Intelligent Laboratory Systems, vol. 118, pp. 62–69, 2012.
View at: Publisher Site | Google Scholar
T. Mehmood, H. Martens, S. Sæbø, J. Warringer, and L. Snipen, “A partial least squares based algorithm for parsimonious variable selection,” Algorithms for Molecular Biology, vol. 6, p. 27, 2011.
View at: Publisher Site | Google Scholar
D. You, P. Bastian, J. Wu, and T. Wardlaw, Levels and trends in child mortality: report 2013, The World Bank, Washington, DC, USA, 2013.
N. J. Kassebaum, A. Bertozzi-Villa, M. S. Coggeshall et al., “Global, regional, and national levels and causes of maternal mortality during 1990–2013: a systematic analysis for the global burden of disease study 2013,” The Lancet, vol. 384, pp. 980–1004, 2014.
View at: Google Scholar
M. Z. Zakar, R. Zakar, M. Mustafa, A. Jalil, and F. Fischer, “Underreporting of stillbirths in Pakistan: perspectives of the parents, community and healthcare providers,” BMC Pregnancy and Childbirth, vol. 18, p. 302, 2018.
View at: Publisher Site | Google Scholar
H. Cramér, Mathematical Methods Of Statistics (PMS-9), vol. Vol. 9, Princeton university press, Princeton, NJ, USA, 2016.
A. A. Tschuprow and M. Kantorowitsch, “Principles of the mathematical theory of correlation,” William Hodge, Cambridge, UK, 1939, Technical Report.
View at: Google Scholar
M. Friendly and S. Institute, Visualizing Categorical Data, Sas Institute, Cary, NC, USA, 2000.
G. U. Yule, “On the methods of measuring association between two attributes,” Journal of the Royal Statistical Society, vol. 75, no. 6, pp. 579–652, 1912.
View at: Publisher Site | Google Scholar
L. Eriksson, E. Johansson, N. Kettaneh-Wold, and S. Wold, Multi-and Megavariate Data Analysis, Umetrics Umeå, Umea, Sweden, 2001.
O. M. Kvalheim and T. V. Karstang, “Interpretation of latent-variable regression models,” Chemometrics and Intelligent Laboratory Systems, vol. 7, no. 1-2, pp. 39–51, 1989.
View at: Publisher Site | Google Scholar
T. N. Tran, N. L. Afanador, L. M. C. Buydens, and L. Blanchet, “Interpretation of variable importance in partial least squares with significance multivariate correlation (smc),” Chemometrics and Intelligent Laboratory Systems, vol. 138, pp. 153–160, 2014.
View at: Publisher Site | Google Scholar
A. Kols, Z. Gorar, M. Sharjeel et al., “Provincial differences in levels, trends, and determinants of childhood immunization in Pakistan,” Eastern Mediterranean Health Journal, vol. 24, no. 4, pp. 333–344, 2018.
View at: Publisher Site | Google Scholar
J. Gardosi, V. Madurasinghe, M. Williams, A. Malik, and A. Francis, “Maternal and fetal risk factors for stillbirth: population based study,” Bmj, vol. 346, no. 3, p. f108, 2013.
View at: Publisher Site | Google Scholar
I. K. Sørbye, C. Stoltenberg, J. Sundby, A. K. Daltveit, and S. Vangen, “Stillbirth and infant death among generations of pakistani immigrant descent: a population-based study,” Acta obstetricia et gynecologica Scandinavica, vol. 93, pp. 168–174, 2014.
View at: Google Scholar

Copyright

Copyright © 2021 Tahir Mehmood. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

253

Downloads

536

Citations