Abstract

In 2012, the Korea Institute of Oriental Medicine proposed an objective and comprehensive physical diagnostic model to address quantification problems in the existing Sasang constitutional diagnostic method. However, certain issues have been raised regarding a revision of the proposed diagnostic model. In this paper, we propose various methodological approaches to address the problems of the previous diagnostic model. Firstly, more useful variables are selected in each component. Secondly, the least absolute shrinkage and selection operator is used to reduce multicollinearity without the modification of explanatory variables. Thirdly, proportions of SC types and age are considered to construct individual diagnostic models and classify the training set and the test set for reflecting the characteristics of the entire dataset. Finally, an integrated model is constructed with explanatory variables of individual diagnosis models. The proposed integrated diagnostic model significantly improves the sensitivities for both the male SY type (36.4% 62.0%) and the female SE type (43.7% 64.5%), which were areas of limitation of the previous integrated diagnostic model. The ideas of these new algorithms are expected to contribute not only to the scientific development of Sasang constitutional medicine in Korea but also to that of other diagnostic methods for traditional medicine.

1. Introduction

In recent years, the interest in and use of constitutional health care services have been reported to increase consistently. According to a survey by the Korea Institute of Oriental Medicine (KIOM), the percentage of constitutional health services in total traditional medical services reached 30.7% in 2015, which is approximately 7% higher than the corresponding value of 23.8% in 2004 [1].

Sasang constitutional medicine is a personalized medicine that diagnoses the patient’s constitution as one of the four Sasang constitutional (SC) types (Tae-Yang: TY, Tae-Eum: TE, So-Yang: SY, and So-Eum: SE) and treats him/her differently depending on the constitution. Therefore, a precise diagnosis of constitution is central to producing consistent results when various experts diagnose the constitution of a single person.

From 2007 to 2011, KIOM collected face, body, voice, and questionnaire data from 2,773 subjects who were diagnosed in terms of constitutional prescriptions by Sasang constitutional experts in 23 oriental medical clinics. Based on these data, an objective and comprehensive Sasang constitutional diagnostic model, which has addressed the existing problems, was proposed in 2012 [2].

Based on this model, many studies on the relationship between Sasang constitution and disease have been conducted. For example, the SY and TE types are independent risk factors for nonalcoholic fatty liver disease [3] regardless of obesity level, TE is a strong risk factor for type 2 diabetes [4], and metabolic syndrome increases the risk of cardiovascular disease in certain physical conditions, such as the SY type [5]. Additionally, physiological characteristics may differ across Sasang types: the SY type has a higher value of total nasal resistance compared to the TE and SE types [6], and the TE type may tolerate psychological or oxidative stress better than the other types [7]. Other studies on constitution-specific physiology, such as comparisons of gut microbiota among Sasang constitutional types [8], have also been conducted using a constitutional diagnostic model.

As increasingly more researchers use the diagnostic model developed by KIOM, multiple issues have also been raised.

Firstly, the ratio of predicted SC types should be the same as the proportions of actual SC types. In the existing model, the predicted proportion of the TE type is much higher than the actual one, which, we believe, is due to the fact that the proportion of TE type is the highest.

Secondly, the multicollinearity problem results from the fact that the explanatory variables of the diagnostic model are highly correlated, making the regression coefficient estimates unreliable.

Finally, in the existing diagnostic model, it was assumed that all individual diagnostic components equally contribute to the diagnosis of the SC types, which turns out to be somewhat unreasonable.

Therefore, in this paper, we propose a novel modified model to solve the problems of the previous diagnostic model, and we compare the revised model with the previous model in terms of various aspects.

2. Methods

2.1. Participants and Data Acquisition

Using the same standard operating procedure used in the previous study, we collected more face, body shape, voice, and questionnaire data from the subjects than the previous study. A total of 3,849 patients, ranging from teenagers to people in their eighties, were recruited from 23 sites (oriental medical clinics) between November 2006 and August 2012. This process was approved by the KIOM Institutional Review Board (I-0910/02-001) and we obtained written informed consent from the subjects. As in the previous study [2], several patients were excluded for a variety of reasons, such as a small number of TY-type subjects, subjects below the age of 15 having growth spurts, or improper data (Table 1). Table 2 shows the distribution of subjects by age group.

In comparison with the previous study, the data for one year (August 2011 to August 2012) have been added, and the data extraction algorithm has been enhanced to increase the usage.

2.2. Candidate Feature Variables
2.2.1. Facial Images

A total of 57 facial feature points, including 13 newly added points over the previous study, were extracted using an automatic feature extraction algorithm (Figure 1). However, the feature points of the upper eyelid line, which require a high-resolution image, were excluded for more efficient implementation. Facial candidate feature variables using the extracted points are described in Table 3.

2.2.2. Body Shape

In contrast to the previous study, we excluded both BMI and body weight and reconstructed the model with the remaining variables. The reason for this exclusion is that there is a prejudice to diagnose as a TE type if the BMI and body weight are high and to diagnose as an SE type if BMI and body weight are low. Moreover, in the literature, it is noted that the body shape of the TE type is tall and large rather than obese, and the body shape of the SE type is short and small rather than slender [9].

2.2.3. Voice

Vocal features were extracted using a C++ program combined with the hidden Markov model toolkit [10]. Input vocal signals were divided into multiple windows corresponding to the reference time duration for feature extraction. The window size was 46.4 ms and was mapped to 211 samples at a 44.1 kHz sampling frequency.

The voice signal used was a recording of five vowels and one sentence repeated twice. Unlike the previous study [2], feature extraction was performed only from the sentence, excluding the vowels. Generally, the sentence was more suitable than the vowels because of the unchanged voice information for characterizing an individual. The accuracy and repeatability of the Sasang constitution diagnosis were better in the case of using only the sentence compared to the case of using both the vowels and the sentence [11].

A description of vocal features is provided in Table 4. The harmonic-to-noise ratio (HNR) and the cepstral peak prominence (CPP) are newly added features not found in our previous study [2].

The HNR is a measure that quantifies the amount of additive noise in the voice signal. It is widely used to characterize healthy and disordered voices. The CPP is known to be an accurate predictor and a more reliable measure of dysphonia than other vocal features, such as jitter, shimmer, and HNR [12].

2.2.4. Questionnaire

Binary variables were constructed using the response categories of the questions in the questionnaire, which consisted of multiple-choice questions in Supplementary Table S1 in Supplementary Material available online at https://doi.org/10.1155/2017/9180159. The procedure of generating questionnaire continuous variables is summarized in Supplementary Figure S1.

2.2.5. Compensating for Age Differences

As in previous study, because the candidate feature variables may have shown age-specific trends, a process to eliminate the effect of age was considered by normalizing each candidate feature variable with moving average and standard deviation of the variable for the given age [2].

2.3. Model for Sasang Constitutional Diagnosis
2.3.1. Individual Diagnostic Models

The regression coefficients of the explanatory variables used in the previous individual models were estimated using ordinary least squares (OLS). However, the OLS estimator may be acquired with a large variance of the coefficients and be inestimable if the dimension of explanatory variables is too high or each of them is highly intercorrelated. These problems are referred to as overfitting and multicollinearity, respectively. One well-known solution to these problems is the least absolute shrinkage and selection operator (LASSO), which shrinks the variance of the coefficients and makes other coefficients zero [13]. Coefficients were estimated using the glmnet package implemented in the R software. The tuning parameter was selected from the result of 10-fold cross-validation using the mean square error (MSE) to measure the risk of loss. The decision rule is to choose that gives the minimum mean cross-validated error.

In the questionnaire model, we calculated questionnaire continuous variables with binary variables using LASSO (Supplementary Table S5) and then we constructed questionnaire model with continuous variables using OLS because the dimension of explanatory variables is low and each of them is not highly intercorrelated.

In addition, because the ratio of SC types is not uniform, the tendency in the previous model was to classify the TE type as the highest ratio. As a result, the sensitivity of TE type was high, but the sensitivity of other SC types was significantly low.

Sensitivity was low, particularly in the male SY type and female SE type. To compensate for this problem, weighting by SC types was added to the model by considering the ratios of SC types.

In the previous model, the training and test sets were separated based on the data collection year. However, there was a problem that the two groups are heterogeneous, which came from the effect of differences due to the collection year. In the present model, the complete set was randomly divided into a training set and a test set at a ratio of 7 : 3 considering SC type and age.

The results of estimated coefficients for face, voice, body shape, and questionnaire continuous features for each SC type are shown in Supplementary Tables S2, S3, S4, and S6, respectively.

2.3.2. Integrating Diagnostic Models from Four Diagnostic Components

Let be the estimated probability of the th subject in category for each individual diagnostic model, where = 1, 2, and 3 indicate TE, SE, and SY type, respectively.

In the previous study, the importance of each individual diagnostic model was also considered through multiplying the weights by . The integrated estimated probability of an SC type for the th subject, denoted as , can be defined by the sum of with weight :where indicates each individual diagnostic component; = 1, 2, 3, and 4 represent face, body shape, questionnaire, and voice, respectively.

In this study, we constructed a model with explanatory variables as of individual diagnosis models. This scheme can be regarded as weighting by a methodical and rigorous method, unlike the case of the previous study, in which arbitrary weights were set.

In addition, compared to the previous weight assigned to each individual diagnostic component, the weight of this study was obtained considering each component and SC type as shown in Supplementary Table S7:

Finally, the predicted SC type for the th subject was determined in the same way as in the previous study.where the numbers of the subscripts indicate the TE, SE, and SY types.

3. Results

The predicted results of the proposed integrated diagnostic model are shown in Table 5. Relative to the predictions of the previous integrated diagnostic models, the accuracy in the test set is improved by approximately 10% on average. Moreover, the sensitivities of the male SY type (36.4% 62.0%) and the female SE type (43.7% 64.5%), which were low in the previous model, were significantly improved, while the sensitivity of the TE type and that of the female SY type were somewhat lowered.

Table 6 shows the results of applying a cutoff to the predicted results of the proposed integrated diagnostic model. To extract more typical SC-type predictions, the reference value for the cutoff criterion was changed to the maximum value minus the second highest value from the maximum value among the three probability values of the SC types, which was proposed in the previous study. This change was determined after implementing various cutoff value transformations.

Comparing the performances before and after the cutoff, the sensitivities and accuracies of each SC type were improved by approximately 7% on average. In particular, the sensitivity of the female SE type improved significantly.

This result alone does not make the comparison with the previous model justified due to the fact that it is not a comparison for the same test set. In this study, the number of data cases is greater and the complete dataset is divided into the training set and test set at random. To make an accurate comparison of the present model with the previous one, an additional comparison was performed with only the common part of the data that were included in finding both models among the test sets used in this model (Table 7).

A comparison of the two results shows that the sensitivities of the previous model are greater for the male TE type and the female SY type, but the sensitivities and accuracies of the other SC types are higher than average by more than 10%.

Although the result of the previous study has an advantage because some training data from the previous study are included in the test set of this study, it is notable that the overall performance of the proposed model is significantly improved.

4. Discussion and Conclusions

Relative to the previous studies, the diagnostic performance is significantly improved because of the reasonable considerations of various methodological approaches and the increased number of data cases and variables used to construct the model.

Although the previous diagnostic model offers the same accuracy as the overall model, the sensitivity for the TE type, which has the highest ratio among the SC types, is the highest, whereas the sensitivities for the SE and SY types are significantly lower. This result occurs because the estimation of the regression coefficient of the model is focused on reducing the error for the TE type. Therefore, to improve the performance for the SE and SY constitutional types, although the diagnostic performance for the TE type is somewhat degraded, the regression coefficients of the model are estimated with equal weight by training the model with different weights for each SC type based on their proportions. The TE type, which was frequently predicted in the previous model, was confirmed to have inferior performance, but the performance associated with the other SC types was improved.

As the regression coefficients are difficult to analyze individually because the model consists of multiple explanatory variables, the explanatory variables used in this model are highly correlated because the calculated variables are from similar positions on the body.

In general, the width of a confidence interval of an estimated regression coefficient increases when multicollinearity occurs. The fact that the confidence interval is wide implies that the estimated regression coefficient value is not actually confirmed but is highly likely to differ from the true value of the regression coefficient. Therefore, the influence of the variable cannot be explained correctly, and, as a result, the reliability of the model itself deteriorates.

Among the methods that can solve the multicollinearity problem, LASSO, as used in this study, can reduce multicollinearity without the modification of explanatory variables. Although the resolution of multicollinearity does not lead to improved accuracy, the results suggest that the reliability of the regression coefficient is satisfiable and that stable results will be obtained when new data are analyzed using the model.

In the previous study, the training and test sets were classified based on the data collection year, but the proportions of SC types and age may be heterogeneous in the training and test sets. Therefore, considering the proportions of SC types and age, it is reasonable and well reflective of the characteristics of the entire dataset to obtain both the training set and the test set from the complete dataset.

In traditional medicine in Asia, including Chinese medicine, facial, body shape, voice, and pulse information, as well as ordinary symptoms, are combined in the process of diagnosis for prescription decisions. In particular, many studies on the relationship between Prakriti (Ayurveda constitution) and various objective biological parameters have been actively conducted in the field of Ayurveda, and, at the same time, efforts are being made to propose a standard protocol for the objective and reliable diagnosis of Prakriti [14]. We believe that the proposed method is the most advanced tool to support the ability of Asian doctors to diagnose in an objective and scientific way. The improvement of this algorithm is expected to contribute not only to the scientific development of Sasang constitutional medicine in Korea but also to that of other traditional medical diagnosis methods.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this article.

Acknowledgments

This research was supported by the Bio & Medical Technology Development Program of the NRF funded by the Korean government, MSIP (NRF-2015M3A9B6027138).

Supplementary Materials

Table S1. List of Questions.

Table S2. Result of estimated coefficients for Facial features for each SC type using LASSO.

Table S3. Result of estimated coefficients for Voice features for each SC type using LASSO.

Table S4. Result of estimated coefficients for Body shape features for each SC type using LASSO.

Table S5. Result of estimated coefficients for Questionnaire features for each SC type using LASSO.

Table S6. Result of estimated coefficients for Questionnaire continuous features for each SC type using OLS.

Table S7. Result of estimated coefficients for estimated probability of individual diagnostic model for each SC type using LASSO.

Figure S1. The procedure of generating continuous variables with the response categories of the questions in the questionnaire.

  1. Supplementary Material