Physalis alkekengi L. var. franchetii (PALF) is a traditional Chinese medicine, which is well known for its antimicrobial, anti-inflammatory, antipyretic, and expectorant properties. Its fruits and fruiting calyxes are used as dietary supplements and traditional herbs in China. However, the quality of calyxes is uneven, and it is prone to getting moldy or moth-eaten during storage. High-performance liquid chromatography (HPLC) fingerprints and multivariate chemometric methods were combined to evaluate quality, and three representative compounds were chosen as the quality markers (Q-markers). Hierarchical cluster analysis (HCA) and principal component analysis (PCA) provided a clear discrimination of PALF samples. Through further verification by partial least squares discriminant analysis (PLS-DA), backpropagation artificial neural network (BP-ANN), machine learning, and combination with the determination of the content, biology, and pharmacology effect judgment, galuteolin, rutin, and physalin O could be used as Q-markers that their contents affect the quality of PALF grade evaluation. A simple method was established to rapidly assess the quality of PALF that is important for its clinical application and storage and provide a reference for evaluating the quality of materials used in Chinese medicine.

1. Introduction

Physalis alkekengi L. var. franchetii (PALF) consists of dried persistent calyx, or persistent calyx of the fruit of Physalis alkekengi L. var. franchetii (Mast.) Makino (Solanaceae). It is called “Jin-Deng-Long” (锦灯笼) in Chinese, and different parts of this plant can be used for different medicinal purposes [1]. Its fruits and fruiting calyxes are used as dietary supplements and traditional herbs in China and are commonly prescribed in folk medicine to treat sore throat, mumps, and swollen and painful gums. Its calyxes in particular are considered components of some important medicines. Tea made from it can relieve cough and reduce phlegm as well as detoxifying the body [2]. It consists of physalins and flavonoids as well as terpenoids, alkaloids, aliphatic compounds, sterols, phenylpropanoids, and phenolic acids. Such physalins, as physalin A, have a significant inhibitory effect on the proliferation of a variety of tumors [36]. PALF also has antimicrobial, anti-inflammatory, antioxidation, and hypoglycemic effects [711].

Quality control is an important part of research on traditional Chinese medicine (TCM) because it is important for ensuring the safety, effectiveness, and stability of products. The safety of traditional Chinese medicine is also worthy of attention, such as the removal of heavy metals using some methods [12]. As a new concept, the quality marker (Q-marker) is important for standardizing the system of quality control of TCM [13, 14]. In recent years, research on PALF has focused on the separation of its chemical components, its pharmacological effects, and the determination of the content of its bioactive molecules. However, as PALF can be eaten by moths during cultivation, the medicinal materials are prone to become moldy while they are stored. This reduces the content of the active ingredients of PALF, which in turn degrades its efficacy.

Chemical fingerprinting is a useful method for comprehensively controlling the quality of herbal products. Due to its systematicity and stability, it is consistent with the overall view of Chinese medicine and has a wide range of applications in herbal quality control [15]. Chemometrics is based on metabolic profiles and has exhibited a powerful capability for multivariate analysis [16]. Artificial neural networks (ANNs) are a mathematical tool to simulate biological neural networks, so they can be used for distributed parallel information processing [17]. The backpropagation artificial neural network (BP-ANN) is a one-way multilayer feedforward network that reflects nonlinear mapping from the input to the output. The algorithm consists of three parts: an input layer, a hidden layer, and an output layer [18]. Principal component analysis (PCA) is an unsupervised method of chemometric analysis that can reduce the number of dimensions of the data without losing too much information. It can be used to identify differences between samples and to cluster and visualize outliers [19, 20]. Hierarchical cluster analysis (HCA) is a hierarchical method of clustering in which clusters are not formed by combining smaller groups into larger ones or by subdividing larger clusters [21]. Partial least squares discriminant analysis (PLS-DA), a supervised classification technique, is a form of PLS that uses a categorical response variable Y to improve the separation between categories [22]. With the advent of big data, machine learning has become an important tool in research and applications, and a number of machine learning algorithms have been used in classification problems.

In this study, a strategy combining chromatographic fingerprints with chemical pattern recognition was developed to achieve standardization of PALF quality evaluation. The chromatographic fingerprints with multiple components were established, and five representative components (rutin, luteoloside, luteolin, physalin A, and physalin O) were identified by comparing the relative retention time and characteristic ultraviolet (UV) absorption with those of the reference substances. Multiple pattern recognition models were established by HCA, PCA, PLS-DA, and BP-ANN. Chemical markers were obtained, and machine learning was used for data mining and analysis to identify the Q-markers that affect the quality of PALF. A criterion that correlates its appearance with its quality of assessment for PALF was established. The proposed strategy provides an integrated way to evaluate the quality of PALF and provide a reference for evaluating the quality of Chinese herbal medicines.

2. Materials and Methods

2.1. Reagents and Plants

HPLC-grade methanol and acetonitrile were purchased from Sigma-Aldrich (St. Louis, MO, USA), HPLC-grade formic acid was obtained from Tianjin Kemiou Chemical Reagent Co., Ltd. (Tianjin, China), and distilled water for HPLC analysis was purchased from Watson Group Ltd. (Hong Kong, China). As reference compounds, chlorogenic acid, quercetin, rutin, luteolin, and luteoloside (purity > 98%) were purchased from Shanghai Yuanye Biotechnology Co., Ltd. (Shanghai, China). Physalin A and physalin O (purity > 98%) were all purchased from Nanjing Spring and Autumn Co., Ltd. (Nanjing, China).

The calyxes of samples of PALF were taken from five sites in Hebei, Jilin, Heilongjiang, Anhui, and Shanxi, and all samples were authenticated by Dr. Lin Ma based on the latest version of the Chinese Pharmacopoeia, and their details are shown in Table 1.

2.2. HPLC Analysis
2.2.1. Sample Preparation

(1) Preparation of Sample Solution. All samples were pulverized and passed through a 50-mesh sieve. Subsequently, 200 mg of the powdered sample was dissolved in 5 mL of methanol and sonicated for 30 min (150 W, 40 kHz). After shaking, the sample was centrifuged at 8000 r/min for 5 min. The supernatant was passed through a 0.22 μm microporous membrane. Each sample was processed three times in parallel.

(2) Preparation of Standard Solutions of PALF. All samples were weighed using a 1/100,000 analytical balance (BT125D; Sartorius, Germany) and dissolved in methanol (5 mL).

2.2.2. HPLC Conditions

The analyses were performed on an Agilent 1260 liquid chromatograph system (Agilent Technologies, Palo Alto, CA, USA), equipped with a double pump and a diode array detection. Chromatographic separations were performed on a Chromatographic Column Symmetry® C18 column (4.6 × 150 mm, 5 μm), maintained at 40°C. The mobile phase consisted of water containing 0.3% of phosphoric acid (solvent A) and acetonitrile (solvent B). The gradient consisted of 5%–10% B for 0–3 min, 10%–12% B for 3–10 min, 12%–15% of B for 10–13 min, 15%–30% B for 13–20 min, 30%–40% B for 20–25 min, 40%–45% B for 25–30 min, 45%–80% B for 30–40 min, and 80%–100% B for 40–45 min. The flow rate was maintained at 1 mL/min, the injected volume was 10 μL, and the wavelength for UV detection was set at 254 nm. The absorption spectra of investigated compounds were recorded between 190 and 400 nm. The compounds were identified by comparing their retention times and UV spectra with those of the reference substances.

2.2.3. HPLC Methodological Evaluation

Precision was validated by sample solutions with six consecutive injections. The relative retention time and relative peak area of each common peak were calculated by using luteoloside as the reference peak. The relative standard deviation for the peak retention time was <2% and that for the peak area was <5%, indicating high precision.

Six samples of PALF were prepared in parallel from the same group of PALF samples. The relative retention time and relative peak area of each common peak were calculated by using luteoloside as the reference peak. The RSD for the relative peak retention time was <2% and that for the peak area was <5%, indicating good repeatability.

Samples of the same PALF solution were analyzed at 0, 2, 4, 8, 12, and 24 h. The relative retention time and relative peak area of each common peak were calculated by using luteoloside as the reference peak. The RSD for the relative peak retention time was <2% and that for the peak area was <5%, indicating good sample stability for 24 h.

Luteoloside, rutin, and physalin O were weighed and placed into a 10 mL volumetric flask with a constant volume of methanol. Reserve solutions with concentrations of 0.4 mg/mL, 0.5 mg/mL, and 0.5 mg/mL were obtained and diluted step by step into a series of reference solutions with different concentrations. The sample was injected for analysis, and the peak area was determined. Linear regression analysis was carried out with the reference concentration (C) as the ordinate of the peak area (A), and the equations of linear regression, correlation coefficient, and linear range of each component were obtained. The results are shown in Table 2.

Six copies of S8 powder with known contents of luteoloside, rutin, and physalin O were weighed, and an appropriate amount of the reference substance was added. The average rates of recovery of luteoloside, rutin, and physalin O were 99.24%, 100.01%, and 99.99%, respectively, and their RSD values were 2.76%, 1.43%, and 1.36%, respectively, which met the requirements.

2.3. Software Requirements

Chromatographic data were obtained from a liquid chromatography workstation data management software program (Agilent 1260). HCA, PCA, and PLS-DA were used as the methods of multivariate chemometric classification with SIMCA-P14.0 (Sartorius Scientific Instrument Co., Ltd., Germany). The BP-ANN was applied using SPSS version 23.0 (IBM Corp., Armonk, NY, USA), and the machine learning algorithm was programmed with Python 3.7.6.

2.4. Determination of Chemical Markers

The sample solution was prepared as described in Sample Preparation, and the sample was injected and analyzed under chromatographic conditions identified above to determine the contents of the chemical markers.

3. Results

3.1. Similarity Evaluation

Representative HPLC-DAD chromatograms of standards and samples are shown in Figure S1. We obtained the fingerprint superposition using multipoint correction, automatic matching, and the median method as shown in Figure 1. The 31 common peaks in the different samples were obtained by multiple points and automatically matched, and the similarity was characterized by the correlation coefficient. As shown in Table S1, the similarity calculated for each sample was 0.669–0.995, indicating that the different batches were consistent on the whole.

3.2. PCA Modeling

PCA is an exploratory analysis tool used to find patterns in data, which is one of the most commonly used unsupervised methods. The PCA approach helps ensure equal contributions of the variables to the results. By using the principle of maximum variance, multiple variables of the original data were linearly fitted to generate new low-dimensional variables to replace the original high-dimensional ones. HPLC workstation data management software was used to obtain the peak area, retention time, and other related information of PALF. The obtained data matrix (with the dimensions 28 samples × 31 variables) was imported into SIMCA-P14.0. The values of R2X and Q2 of the PCA model were 0.784 and 0.908, respectively. The PCA score chart (Figure 2(a)) shows that the samples were divided into two categories. The G1 samples were characterized by orange or orange-red in the permanent calyxes and almost no insects or mold spots. G2 samples were brown or orange-brown with varying degrees of mold or insect infestation. To further explore the influence of mold and insects on the samples, samples in G2 were subjected to PCA, and the results are shown in Figure 2(b). Each sample was divided into two types: the first type had mild mildew and moth-eaten samples, while the other type had more severe mildew and moth-eaten samples.

3.3. HCA Modeling

The PCA score chart shows that the samples were classified according to the extent to which they had been eaten by moths and covered with mildew. HCA is a technique of multivariate analysis used for verification. The HCA of the samples was performed using SIMCA-P14.0. Ward’s method was used as the amalgamation rule, and the squared Euclidean distance was used as a metric to establish the clusters of samples. Figure 3(a) shows that each sample could be grouped into two categories. One consisted of S7, S8, S9, S10, S20, S21, S22, S23, S24, S26, S27, and S28, and the other consisted of S1, S2, S3, S4, S5, S6, S11, S12, S13, S14, S15, S16, S17, S18, S19, and S25. The results of an HCA of the second type (of inferior samples) are shown in a tree diagram in Figure 3(b). It shows that these types of samples could be divided further into two types.

3.4. PLS-DA Modeling

Partial least squares (PLS) is a way to identify the basic connection between an independent variable X and an independent variable Y [23]. To identify chemical markers, a supervised PLS-DA model was established for the 28 batches of samples. The 3D score chart of PLS-DA (Figure 4(a)) showed good clustering, indicating that the samples had been divided into two categories. In general, a value of Q2 greater than 0.5 indicates good predictability, while a low value indicates noisy data. In our model, R2Y = 0.914 and Q2 = 0.878, indicating that the accuracy of the model was relatively high. This shows that the PLS-DA model fitted the data well and predicted new data. PLS-DA was also performed on 16 batches of low-quality samples, as shown in Figure 4(b). The cross-validity score Q2 (cum) = 0.887. According to the values of the above factors, the six best components were extracted and the principal components were obtained. The cumulative contribution rate R2Y (cum) = 0.945.

To weigh the importance of each variable on the recognition capability of the model, a VIP (variable importance in projection) diagram was drawn (Figure S2) to explain the importance of the variables as well as X and Y. The figure shows that the VIP values of variables 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 14, 16, 19, 22, 23, 28, 30, and 31 were greater than one, which means that these variables played an important role in distinction. Among them, variable 7 represents rutin, variable 9 is luteoloside, variable 14 is physalin A, and variable 16 is physalin O.

3.5. BP-ANN Modeling

The BP-ANN model was used to further improve the accuracy of recognition of the model. A three-layer BP-ANN model was established and used to distinguish among 84 samples. The number of neurons in the input layer was 31, that in the hidden layer was three, and that in the output layer was two.

The training and test sets were used to evaluate the performance of the model in terms of recognition and prediction. To predict the classification of the samples, two-thirds of them were randomly selected as the training set and the remaining as the test set. The reproducibility and accuracy of the model were verified by separating the test data from the calibration process.

Table S2 shows that all samples were successfully divided into two groups, the predicted and the test values fitted well, and the accuracy was 100%, indicating that the proposed artificial neural network had performed very well. Compared with the PLS-DA method, the BP-ANN model was more accurate. As shown in Figure 5, the ten most important variables in the BP-ANN model were 26, 9, 30, 23, 16, 11, 4, 6, 12, and 8. Among them, variable 9 represents luteoloside, variable 16 is physalin O, and variable 12 is luteolin.

3.6. Machine Learning Algorithm

This experiment was carried out on the Vscode platform, and Python 3.7.6 was used for programming. The machine learning model used the XGBoost model (an improvement over the GBDT algorithm): the limit gradient lifting algorithm. The GB tree is a learning device. The parameter representing the learning objective was set to “binary: logistic” to solve the dichotomy problem. The maximum depth of the tree was set to five and the number of iterations to 50, where this could be adjusted according to the amount of data.

The performance of the final model is shown in Figure S3, and the results of its prediction on the test set are shown in Figure S4. The correctness of peaks 7, 9, and 16 in terms of sample classification was 96.15%. According to the importance function inherent in the XGBoost model, the importance values of three characteristic peaks (luteoloside, variable no. 9; rutin, variable no. 7; and physalin O, variable no. 16) were obtained as shown in Figure S4. The results show that the samples were accurately classified and recognized according to these three characteristic components, and their order of importance was variable 16 > 7 > 9.

3.7. Determination of Chemical Markers

The combined results of PLS-DA, BP-ANN, and machine learning show that luteoloside, rutin, and physalin O affected the quality of the samples. Determination of water content in the experiment material PALF was carried out using some methods before the test [2426], which will help the test results to be more accurate. Their contents were determined to choose the most suitable components that could be used as quality markers.

The results are shown in Table S3. The content of luteoloside in all batches of samples was in line with that in the Chinese Pharmacopoeia (>0.1%), indicating that the quality of luteoloside could not be evaluated by a single component. There were significant differences in the contents of each component among the samples, which were consistent with the results of fingerprint analysis. The independent sample t-test was used to compare the differences in the contents of the three components in the two types of samples. The components showed significant differences among the three groups (), indicating that they were suitable for use as Q-markers to identify the traits of appearance of the samples. To clarify the differences in the contents of active ingredients in PALF more intuitively, a bar graph is drawn in Figure 6.

3.8. Biological and Pharmaceutical Activities of Q-Markers

Information on the Q-markers, including the chemical structure and biological and pharmaceutical properties, in the PALF samples is listed in Table 3 [2733]. As the contents of the three components in the first type of samples were significantly higher than those in the second and third types of samples, they could be used as markers to assess the quality of the samples.

4. Discussion and Conclusions

Many factors affecting the quality of medicinal materials can be directly reflected in their appearance. Appearance is often used as the standard to evaluate quality in traditional Chinese medicine. Therefore, it is important to establish a standard of evaluation that correlates appearance with quality and to obtain important quality markers for the assessment of traditional medicine. This study proposed a method to assess the quality of PALF from two perspectives: the content of active ingredients in it and its appearance. The quality of PALF may be affected by its picking and storage conditions, when the calyxes may be mildewed or worm-eaten. Compared with traditional methods of identification, such as microscopic identification, TLC identification, and content determination, the proposed method was used to extract the chemical markers based on a comprehensive analysis of the chemical constituents of PALF. Analysis through machine learning methods and the use of chemical markers showed that luteoloside, rutin, and physalin O can be used as Q-markers as their contents influence the quality of PALF. Furthermore, the emerging potential methods of pretreatment and extraction of active components [3436] may bring unexpected insights into the material basis of PALF. The proposed method thus provides a scientific basis and technical means to assess the quality of PALF and provide a reference for evaluating the quality of materials used in TCM.


PALF:Physalis alkekengi L. var. franchetii
HPLC:High-performance liquid chromatography
PCA:Principal component analysis
HCA:Hierarchical cluster analysis
PLS-DA:Partial least squares discriminant analysis
BP-ANN:Backpropagation artificial neural network
VIP:Variable importance in projection.

Data Availability

The data used to support the findings of this study are included within this article.

Conflicts of Interest

The authors declare no conflicts of interest.

Authors’ Contributions

Meiqi Liu and Ziying Qiu were involved in formal analysis. Lili Sun and Lizhi Wang performed the methodology. Xiaoliang Ren was responsible for project administration. Yanru Deng obtained the resources. Xiaoran Zhao ran the software. Meiqi Liu reviewed and edited the manuscript.


This work was financially supported by the National Key R&D Program of China (2019YFC1711000).

Supplementary Materials

Figure S1: HPLC chromatograms of mixed standard compounds (a) and PALF samples of S8 (b). Figure S2: VIP plot of PLS-DA. Figure S3: Performance of the machine learning model and results of prediction on the test set. Figure S4: results of the importance of the three characteristic peaks. Table S1: similarity evaluation of 28 batches of PALF. Table S2: results of training the BP-ANN model. Table S3: compounds identified from three chemical markers of PALF. (Supplementary Materials)