Abstract

This paper proposes a separation method, based on the model of Generalized Reference Curve Measurement and the algorithm of Particle Swarm Optimization (GRCM-PSO), for the High Performance Liquid Chromatography with Diode Array Detection (HPLC-DAD) data set. Firstly, initial parameters are generated to construct reference curves for the chromatogram peaks of the compounds based on its physical principle. Then, a General Reference Curve Measurement (GRCM) model is designed to transform these parameters to scalar values, which indicate the fitness for all parameters. Thirdly, rough solutions are found by searching individual target for every parameter, and reinitialization only around these rough solutions is executed. Then, the Particle Swarm Optimization (PSO) algorithm is adopted to obtain the optimal parameters by minimizing the fitness of these new parameters given by the GRCM model. Finally, spectra for the compounds are estimated based on the optimal parameters and the HPLC-DAD data set. Through simulations and experiments, following conclusions are drawn: (1) the GRCM-PSO method can separate the chromatogram peaks and spectra from the HPLC-DAD data set without knowing the number of the compounds in advance even when severe overlap and white noise exist; (2) the GRCM-PSO method is able to handle the real HPLC-DAD data set.

1. Introduction

After more than 100 years’ development, the technology of chromatography has become the collective term for a set of laboratory technique for quality control of various mixtures such as herbal medicine, grape wine, agriculture, and petroleum. With the development of the chromatographic instrument, the High Performance Liquid Chromatography with Diode Array Detector (HPLC-DAD) technology is used in many researches to generate a data set containing the chromatogram peaks and spectra for all compounds. Figure 1 shows the principle of the HPLC-DAD data set. The sample is injected at the sample injection. The high pressure pump drives the solvent to carry the sample to go through the column with absorbent. Different compounds will receive different resistance when they go through the column. Given an ultraviolet detector at the bottom of the column, a chromatogram peak represented by will be observed when one compound comes out from the column. The position and area of the peak can tell the name and the amount of the compound. If the detector is a DAD, which has more than one thousand channels to detect multiwavelength simultaneously, the spectrum for the same compound represented by will also be recorded as well. represents th compound and represents the mixture. The relationship of the variables in Figure 1 can be shown as where indicates the number of the compounds.

For the data set in (1), there are already several methods to separate it, but with insufficiencies. The algorithm of evolving factor analysis (EFA) [1, 2] and its improvements such as evolutionary factor analysis (EVOLU) [3], fixed-size moving window evolving factor analysis (FSMWEFA) [4], heuristic evolving latent projections (HELP) [5], and orthogonal projection resolution (OPR) [6] are used for peak purity, but without full quantitative information. The method of Multivariate Curve Resolution with Alternating Least Square (MCR-ALS) [7, 8] can recover the pure species spectra and elution profiles. However, the MCR-ALS method will be unavailable when the compounds become complex (see the simulations). And the performance of the MCR-ALS method depends on two important parameters: (1) a threshold for deciding the number of the compounds; (2) the noise level of the data set for estimating initial spectra. Usually, it is not easy to decide these two parameters when noise exists (see Appendix A for explanation). The immune algorithm (IA) [9, 10] can extract the compounds from noise. But, the standard chromatogram peaks for compounds are needed from experiments in advance. The method of independent component analysis (ICA) [11] can separate the HPLC-DAD data set without knowing the number of the compounds in advance. But the cluster methods are still needed to select compounds from the obtained independent components. Our previous works proposed a model named independent components analysis constrained by reference curve (ICARC) and its solution by multiarea genetic algorithm (mGA) [12] and by multitarget Particle Swarm Optimization (mPSO) [13], respectively, which can extract the chromatogram peaks from the HPLC-DAD data set directly. However, through further analysis, we find that it is not necessary for the chromatogram peaks (source signals) to be independent from each other. So a method based on the model of Generalized Reference Curve Measurement (GRCM) and the algorithm of Particle Swarm Optimization (PSO) is proposed in this paper.

The remainder of this paper is arranged as follows: Section 2 introduces the principle of the GRCM-PSO method; Section 3 gives the simulations and experiments; finally, Section 4 draws the conclusions and future works.

2. Mathematical Methods

It is difficult to extract the and in (1) only based on the data set without any other knowledge. Fortunately, the fact that the shape of a chromatogram peak looks like a Gaussian curve [14] can help. Based on this “a priori” knowledge, the GRCM-PSO method is proposed as shown in Figure 2. Firstly, a reference curve with parameter is constructed based on the general shape of the chromatogram peak, according to which the initial population , are generated. Then, the GRCM model calculates the errors , for the parameters. Following, a search category is used to obtain the rough solutions , (). In the dashed rectangular box in Figure 2, a step called reinitialization generates parameters randomly around one rough solution, for example, , in Figure 2 for the first rough solution. The GRCM model calculates the errors for these . Based on these errors , the PSO algorithm is adopted to obtain the optimal parameter around . Similarly, other optimal parameters , can be found. Finally, the approximated chromatogram peaks can be constructed by the reference curve, and the spectra can be obtained by an estimator.

The structure of the parameter used in this paper is the same as that in literature [13], which is shown by (2) and (3). Equation (2) is the Gaussian curve which will be used in the simulations to demonstrate the performance of the GRCM-PSO method; (3) is a 5-parameter curve which will be used in the experiments to show the practicability of our method: where is the column number of . is the combination of two Gaussian curves at the peak position with and for each side’s width and and for each side’s deviation from zero. The ranges of the , , are limited in order to guarantee that every peak has an integral shape. , in the experiments due to the profile of the data set. is the function to limit the amplitude at 1.

In order to obtain initial parameters with small errors, four times of initialization with the same population of are implemented to generate 8000 parameters totally and only the top 2000 parameters according to their errors are chosen as the initialized parameters.

2.1. The Model of GRCM

The function of the GRCM model is to assess the parameters by calculating their errors, which indicate the distance between the reference curves constructed by these parameters and the chromatogram peaks existing in . As shown in Figure 3, the GRCM model is composed of five elements: reference curve , data set , Reference Curve Measurement (RCM) model, predicted curve (PC) and measurement operator (MO) .

The RCM model is designed by introducing a vector so that

Equation (4) means that let approximate and look like . Then, we have the objective function as

Solving (5), we obtain the RCM model as where is a matrix generated from , which will be introduced in Appendix B as well as the deducing process from (5) to (6).

The MO is designed as

2.2. Search Category and Reinitialization

After the initialization, every parameter will search within a small hypersphere to find one parameter with the smallest error as its target . It is possible for to find as its target, that is, , and for to find as its target, that is, . In order to accelerate the searching speed, we directly set . Finally, only limited parameters have been chosen as targets for others, which are the rough solutions , (). It is assumed that all the real solutions are around the rough solutions because the intensively and randomly distribution of the initializing parameters. So a step named reinitialization only around rough solutions will reduce the search area significantly. The areas for reinitialization are hyperspheres, whose radii are half of the smallest distance between the centre rough solution and other solutions in order to cover all the possible spaces. An example of such a hypersphere is illustrated in Figure 4. There are five rough solutions , in the two-dimensional space. The distance between and other rough solutions is . So the hypersphere for is shown by the circle in Figure 4, where . The population in every hypersphere is set to 10.

2.3. Algorithm of PSO

PSO is swarm intelligence that emulates social interaction and individual cognition of bird flocks foraging [15, 16]. Equation (9) gives the algorithm of PSO: where and represent the position and velocity of the th particle, respectively; is the inertia weight; and are acceleration constants; and are two random numbers in ; is the personal best position for ; and is the global best position. Please see relative literatures for the values of the parameters in (8).

In this paper, all the parameters are divided in several different groups within certain hyperspheres. And every group updates these particles according to (8), respectively, until the value of the best particle in every group does not change for 500 steps, or the maximum step is reached.

2.4. Other Processes

During the process from , to , in Figure 2, the random initialization of may cause inaccuracy in the results. So this process is executed multiple times to eliminate the influence of the random initialization. Through observation, ten times was chosen. Ten executions will generate 10 candidate solutions. There can be difference among the value even the number of the optimal parameters among these candidates. Firstly, select one candidate with the maximum number of parameters as a reference. Then, select one parameter from every candidate to be grouped with one parameter in the reference according to the Euclidean distance and count the number of parameters in every group. Only the groups with the number of parameters lager than 6 are selected as valid groups. Finally, choose one parameter with lowest error from every valid group to form the final result.

Finally, the estimator is designed as the following equation to calculate the spectra for all the compounds: where is the approximation of chromatogram peaks; is the pseudoinverse function. Equation (9) is derived from (1) directly.

3. Simulation, Experiment, and Discussion

In this section, a group of simulations are given to demonstrate the performance of the GRCM-PSO method. Then experiments on a HPLC-DAD data set are implemented to show the practicability of the GRCM-PSO method. Two criteria are used to evaluate the results: (1) whether all the chromatogram peaks can be found; (2) whether the errors between the true/simulated spectra and estimated spectra are small enough.

3.1. Simulation and Discussion

The simulation data set is shown in graphs (a) and (b) of Figure 5, which contains seven compounds with severe overlap. The seven chromatogram peaks are constructed by (2) with the parameters of , , , , , , and . And the seven spectra are constructed randomly as long as they are uncorrelated with each other. The simulation data set is added with different level of whiten noise. The results are listed in Table 1. From the results, we can see the following.(1)The GRCM-PSO method can separate the simulation data set without knowing the compounds’ number in advance even when severe overlap and white noise exist. This is a big advantage over previous method which needs to know the compounds’ number in advance. The values of the and the error between the calculated spectra and the simulated spectra are small. The time cost by this method is much less than that by ICARCmPSO [12], which was 13.9 seconds. However, the MCR-ALS method cannot give correct results. The results given by MCR-ALS method are illustrated in graphs (c) and (d) of Figure 5, in which no noise is added to the simulated data set.(2)The average time cost by the ten implementations is almost the same. This means that the degree of the white noise has no significant influence on the time cost.(3)The values of the and the error between the calculated spectra and the simulated spectra become larger with the increase of the noise level. What should be noted is that when noise becomes severe, the “Errors” for small peaks are influenced more significantly than that for big peaks.

3.2. Experiment and Discussion

The HPLC-DAD data set of “adataset.mat” is downloaded from http://www.mcrals.info/ for free. This data set is a three-compound mixture with two known pesticides and one unknown interferent [8]. The 5-parameter function shown by (3) is used in the experiments as the RC. The graphical results are illustrated in Figure 6 as well as the results given by the ICARCmPSO method and the MCR-ALS method. The values of the results are listed in Table 2. From the experiments, we can see the following.

(1) Comparison between the GRCM-PSO method and the ICARCmPSO [13]: the average time and average steps for the GRCM-PSO method are much less. The and the “Errors” are similar for both of these two methods.

For the ICARCmPSO method, the parameter of scope for particles to search their local targets should be determined according to the specific application. For the GRCM-PSO method, this parameter is fixed to a small value. So, from the view of operability and speed, the GRCM-PSO method has advantage over the ICARCmPSO method.

(2) Comparison between the GRCM-PSO method and the MCR-ALS method: both of them obtain the same number, 3, of the compounds. The speed of the MCR-ALS method is better than that of the GRCM-PSO method. The accuracy of the MCR-ALS method can be better than that of the GRCM-PSO method. But the parameters in the GRCM-PSO method are easier to be controlled.

For the MCR-ALS method, the important two parameters are the threshold to select the valid singular values and the noise level for initial estimation of the spectra. If noise exists, it will be difficult to decide the first parameter as explained in Appendix A. The performance of the MCR-ALS method is also very sensitive to the second parameter as shown in graph (i) of Figure 6. A small change of the , which is explained in Appendix A, will cause big error in the calculated spectra, while all the parameters for the GRCM-PSO method are fixed for all applications.

So, from the view of operability and stability, the GRCM-PSO method has advantage over the MCR-ALS method.

4. Conclusions and Future Works

A method named GRCM-PSO was proposed in this paper to separate the chromatogram peaks and spectra for compounds from the HPLC-DAD data set. The GRCM model transformed the separation problem to a multiparameter optimization issue. The PSO algorithm was introduced to calculate the optimal parameters. Groups of simulations with different noise level were implemented. A simulated data set was constructed with severe overlap among seven compounds. The GRCM-PSO method separated the chromatogram peaks and spectra from this simulated data set without knowing the number of the compounds in advance. And the speed was fast. Groups of experiments on a real HPLC-DAD data set were implemented. And comparisons among the results by the GRCM-PSO method, the ICARCmPSO method, and the MCR-ALS method were given. The results showed that the GRCM-PSO method was an effective, efficient, and practical method to separate HPLC-DAD data set even when severe overlap and white noise existed. The speed and practicability of the GRCM-PSO method are better than that of the ICARCmPSO method. The stability and operability of the GRCM-PSO method are better than that of the MCR-ALS method.

Currently, the performance of the GRCM-PSO method depends on the selection of the reference curve. So it is only suitable for the separation task with “a priori” knowledge to be known, such as the separation of HPLC-DAD data set. The accuracy of the result by the GRCM-PSO method can be improved by further research on more accurate reference curves.

Appendices

A. Parameters of MAC-ALS Method

The flowchart of the MCR-ALS method is illustrated in graph (a) of Figure 7. The number of the compounds is calculated by the SVD method. The SVD method needs a threshold to keep the valid singular values, which is noted as . The initial estimation of is calculated by the pure variable detection method, which needs to know the noise level of the data set which is noted as . The variables of and are two thresholds given according to the data set. Please see relative literature for detailed information of the MCR-ALS method [8].

The singular values from largest to smallest for the HPLC-DAD data set used in the experiment of this paper are listed in graph (b) of Figure 7. Although three compounds are known to be contained in the data set, there is no obvious boundary between the third singular value and the fourth one. That is to say, it is not easy to set the value for the parameter .

As shown in graph (i) of Figure 6, a small change of the parameter will lead a big error for the calculated spectrum.

B. Deducing Process of RCM Model

In order to make the calculation for (5) simpler, a preprocessing [17] is used to transform as where is a matrix generated in the preprocessing; is a matrix in which every row is filled with the average of every row of (see [17] for details). Every column vector in (B.1) satisfies where is the row number of the matrix . Then, (6) can be transformed as where is the number of the columns of ; and are the matrices generated in the preprocessing. If we set where is a vector with the same value for every element referring to a specific and (B.3) is transformed as The proof of will be given in Appendix C. Equation (B.5) is an optimization problem. According to the Karush-Kuhn-Tucker (KKT) conditions [18], the solution should satisfy where is the value of th element under parameter . The Jacobean matrix of (B.6) is Therefore, the following formula is obtained based on Newton iteration [19]: With (B.8), we can calculate as

C. Proof for (B.5)

From the definition in (B.5), we have where is the column number of the matrix and () are the column vectors in . Then, we have Because (B.2), we have Because the transformation from to does not change the original amplitude, so we have where is the column vector with zero mean [16]. Substitute (C.3) and (C.4) in (C.2); we have

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

Lizhi Cui thanks School of Information Technologies, the University of Sydney, for providing him with a Ph.D. fellowship; thanks are due to Chinese Scholarship Council for providing Lizhi Cui with financial support; the student’s no. is 201206740061.