Journal of Analytical Methods in Chemistry

Volume 2018, Article ID 9031356, 11 pages

https://doi.org/10.1155/2018/9031356

## Collaborative Penalized Least Squares for Background Correction of Multiple Raman Spectra

^{1}Faculty of Science and Technology, University of Macau, E11 Avenida da Universidade, Taipa, Macau^{2}Chemistry and Chemical Engineering, College of Biology, Hunan University, Changsha 410082, China

Correspondence should be addressed to Long Chen; om.camu@nehcgnol

Received 21 April 2018; Revised 8 July 2018; Accepted 26 July 2018; Published 29 August 2018

Academic Editor: Małgorzata Jakubowska

Copyright © 2018 Long Chen et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Although Raman spectroscopy has been widely used as a noninvasive analytical tool in various applications, backgrounds in Raman spectra impair its performance in quantitative analysis. Many algorithms have been proposed to separately correct the background spectrum by spectrum. However, in real applications, there are commonly multiple spectra collected from the close locations of a sample or from the same analyte with different concentrations. These spectra are strongly correlated and provide valuable information for more robust background correction. Herein, we propose two new strategies to remove background for a set of related spectra collaboratively. Based on weighted penalized least squares, the new approaches will use the fused weights from multiple spectra or the weights from the average spectrum to estimate the background of each spectrum in the set. Background correction results from both simulated and real experimental data demonstrate that the proposed collaborative approaches outperform traditional algorithms which process spectra individually.

#### 1. Introduction

Raman spectroscopy, which provides valuable chemical and physical information of studied samples, is widely used as an analytical tool for many applications like material identification, chemical detection, and biomedical analysis [1–3]. Peaks in Raman spectra are the fingerprints of the analyte, and the corresponding peak heights or peak areas have strong correlations with the concentration of the analyte. However, spectral interferences have a strong negative effect on the measurement of peaks, and this in the long run hinders the performance of Raman spectroscopy-based quantitative analysis [4]. Representative interferences for Raman spectra include backgrounds mainly caused by instrument fluctuations and fluorescent substances. The noises of the instrument and the occasional spikes caused by cosmic rays also deteriorate the quality of Raman spectra. As a result, some preprocessing steps should be conducted to handle the interferences in the Raman spectra. In this paper, we mainly focus on the background correction problem.

Numeric background correction algorithms have been proposed in the past decades for Raman and other spectra. For example, the wavelet transform is used as a powerful tool for background removal by decomposing the Raman signals in the frequency domain [5–7]. Because the performance of wavelet-based approaches is greatly affected by the selection of base wavelets and scales, the adaptive wavelet transform was used in [8] to obtain a multiresolution decomposition of a Raman spectrum. The low-frequency background and high-frequency noise were removed thereafter.

The iterative smoothing algorithms also play an important role in background estimation because the background is usually characterized by its smooth variation. The general procedure of this approach is continually smoothing the spectrum until the background is obtained. Many well-known smoothing filters and their enhancements have been widely used in previous studies for the purpose of iteratively removing peaks and deriving backgrounds in the spectra [9–12]. The drawback of such smoothing algorithms is the difficulty in automating their iterations, although some endeavors have been made on this issue [10, 11].

Curve fitting is to fit the background with appropriate points in the spectrum by some fidelity or loss functions like the least squares [13–15]. Such a kind of selection-then-approximation approach is very similar to the manual background estimation procedure. The simple implementation and short running time of curve fitting have made it one popular background correction method used in real applications. Different curves such as Bezier curves [16], splines [13], and polynomial functions [14, 17] have been used to fit the background. On the contrary, without specifying the curve shape of the background, the penalized least squares- (PLS-) based algorithms attempt to automatically estimate the background by a direct approximation of the spectrum with a penalization on the roughness of the approximated curve [15, 18]. In the PLS-based algorithms, one critical issue is the weight setting for different points in the spectrum. These weights are used to indicate the contribution of corresponding points to the final curve construction. Many methods have been proposed to this end [19, 20], and some automatic setting techniques are also suggested [21, 22].

Some comparisons on different baseline correction approaches have been conducted [23], and the optimal choice of background removal for the statistical analysis of spectra has been explored [24]. However, by far, there is no single automatic method that can well handle all the spectra universally and be regarded as the best. Recently, more new baseline correction algorithms like the ones based on sparse representation [25] and neural networks [26] have been proposed.

In practical applications of Raman spectroscopy, multiple measurements of a given analyte are normal practices. A set of strongly correlated spectra is then obtained although they may be generated under different environmental conditions and sampling protocols. In the quantitative Raman analysis like the mixture analysis and multivariate calibration, except multiple measurements, multiple spectra derived from the same mixture with different analyte concentrations are also correlated. Due to the varying backgrounds and random noises in the set of related spectra, the clear information like the peak locations in one spectrum may not be significant in another spectrum. How do we collaboratively use the valuable information in a set of related spectra for the purpose of spectrum preprocessing? Foist et al. first noticed this problem and proposed a method to denoise multidimensional spectral data collaboratively [27]. For background correction, few approaches have been proposed by utilizing the common characteristics shared in a set of related spectra [6, 28, 29]. For instance, the multiple spectra baseline correction (MSBC) algorithm designed in [28] assumed that the pairwise differences between the background removed spectra are small and inserted a regularization for this prior to the asymmetric least squares.

In this paper, based on PLS, we propose a new approach focusing on collaborative background correction for a set of related spectra. Specifically, our main contribution is to design two ensemble strategies to embed the weight information of PLS from multiple spectra to boost each spectrum’s background correction. For PLS, the weight of each point in a spectrum denotes the contribution of the point to the final background estimation. In the first scheme, we directly use the weights derived from the average spectrum (by averaging all the related spectra) to calculate the background of each spectrum. The second scheme applies the average of weights obtained from each spectrum using traditional PLS-based approaches. By using the two schemes, our new approach utilizes the strong correlations among multiple spectra and suppresses the effect of noise and signal variation in different spectra.

To illustrate the advantage of collaboratively calculating the weights for PLS algorithms when handling several related spectra, we combine the adaptive iteratively reweighted penalized least squares algorithm (airPLS) [19] and the morphological weighted penalized least squares algorithm (MPLS) [20] with the proposed weight ensemble strategies. In the experiments on the synthetic and real Raman spectra, these enhanced PLS approaches show accurate and robust background removal capability.

#### 2. Theory

##### 2.1. PLS for Background Correction

The signal smoothing problem was first proposed by Whittaker in 1922 [30]. The pioneering works on applying PLS for baseline correction were conducted by Eilers et al. more than 10 years ago [31, 32]. The rationale behind PLS is to approximate the observed data by balancing the conflicts between the fidelity to original data and the roughness of fitting data.

Assume that is a vector of the Raman spectrum and is the fitting vector; both of them are with the length of elements. The fitted should keep the fidelity to as well as the roughness of the fitted vector. denotes the fidelity to the Raman spectrum , which can be expressed as the sum of squares of differences between and :

denotes the roughness of the fitting vector , which can be expressed as the sum of squares of differences between each element of and its neighbors:where the square of first differences penalty is adopted in (2) to simplify the presentation. In other cases, it is also a natural way to quantify the roughness by the square of higher-order differences.

The following equation is adopted to measure the balanced combination of fidelity and roughness:where is a user adjustable parameter that balances the fidelity and roughness. Larger favours a smoother fitted vector.

In order to apply the PLS to estimate background, a weight vector was introduced for fidelity; its element can be regarded as a weight that depicts the reliability of point as a part of background. Then, is changed to

To solve the minimization problem of (3), we get a linear system by equating the partial derivatives of to zero , and the matrix form of the obtained linear system is as follows:where is a diagonal matrix with on its diagonal and is the derivative of an identity matrix. Finally, we solve the fitting vector as follows:

There have been some proposed methods for the weight calculation in PLS. To control the smoothness of the fitted vector iteratively, the airPLS method [19] calculates the weight vector in an adaptive way. in each iteration is obtained as follows:

The vector consists of negative elements obtained from the subtraction between and in the iteration step. The fitted vector in the previous iteration step is a candidate of the baseline. If the value of the signal is greater than the candidate, it can be seen as a part of the peak, of which the weight is set to zero. If not, the weight is calculated as (7). When the iteration count reaches the maximum or when the following termination criterion is satisfied, the iteration will stop and the final weight vector is used for PLS to generate the background:

Unlike airPLS which adjusts the weights adaptively, the MPLS method [20] directly calculates the weight vector by applying the mathematical morphology operations on the spectrum to remove the peaks and generate a rough background firstly. The morphology operation involves an object spectrum and a plane structuring element . The transformation is an opening operation which consists of dilation and erosion. To refine the background, the local minimum points between peak areas are selected as meaningful background points with weight 1, and the remaining points are set with a weight of 0. The weighted PLS is then applied to get the final background.

##### 2.2. Collaborative Weighted Penalized Least Squares for Multiple Spectra

As discussed in Introduction, in practical applications of Raman spectroscopy, we may collect strongly related spectra that are from either the same kind of material or the solution with different proportional concentrations. In these cases, we can collaboratively estimate each spectrum’s background by comprehensively considering the valuable information shared in the whole set of spectra. Specifically, for the weighted PLS-based approaches, we design two schemes to utilize the global information in the set of spectra for the weight calculation.

The simplest information fusion approach for a set of highly related spectra is to average them. More formally, given a set of spectra , we calculate the average spectrum as follows:

By averaging, the effect of noise on spectra is suppressed, and the average spectrum can be regarded an informative representation of the set of related spectra. With the average spectrum, we can apply some traditional weighed PLS-based approaches to calculate the weights of different points that denote the reliabilities of their reliabilities as some parts of background. Then, the obtained weight vector is used for each single spectrum’s background removal. For a more accurate fusion, we may set weights for different spectra in the summation of (9). For example, the high-quality spectrum with a higher signal-noise ratio may take a higher weight to contribute more to the final average spectrum. But in our experiments, we find the simple summation produces good results as well.

The second scheme to fuse the information from multiple spectra is to ensemble the weight vectors of all the spectra. For the spectrum , we first use some traditional weighted PLS-based approaches to calculate the weight vector for it. Then, the weight vectors for all the related spectra are combined into one as follows:where the combined weight vector will be used as the final weight vector in weighted PLS-based background estimation of each spectrum.

To illustrate the application of the two schemes proposed above, we improved the airPLS and MPLS methods by modifying their final weight vectors. In these two PLS-based background estimation methods, no matter the weight vector for points in the spectrum is adaptively adjusted in the iteration (the airPLS case) or the weight vector is directly obtained from morphology operations, the final step to estimate the background is a linear regression step that solves (5) by using (6) with the determined weights.

Given a set of spectra , we have the following 4 enhanced PLS approaches to estimate each spectrum’s background.

###### 2.2.1. Average Spectrum-Based airPLS (AS airPLS)

In this method, we first derive the average spectrum from (9). Then, the airPLS is applied over the . But here, we only record the weight vector derived at the last iteration of airPLS using (7). This weight vector is denoted as . Now, the background of is calculated by using (6), in which the weight vector is and the spectrum is .

###### 2.2.2. Combined Weight-Based airPLS (CW airPLS)

This method first applies airPLS for each spectrum. We only record the final weight vector for each spectrum obtained in the last iteration of airPLS. Then, the combined weight vector is calculated by using (10), and we use this weight vector to derive each spectrum’s background by using (6).

###### 2.2.3. Average Spectrum-Based MPLS (AS MPLS)

This method is very similar to AS airPLS. The only difference is that is calculated by applying MPLS on the average spectrum .

###### 2.2.4. Combined Weight-Based MPLS (CW MPLS)

This method is similar to CW airPLS. The difference is that the initial weight vector for each spectrum is obtained by MPLS instead of airPLS.

Because the weight vector is also extensively used in other weighted PLS-based background removal approaches [33–35], including some recently proposed enhancements of airPLS [18], the average spectrum and combined weight-based schemes proposed in this paper can be easily adopted by these approaches to collaboratively process a set of related spectra.

#### 3. Experimental

To verify the performance of collaborative approaches proposed in this paper on background removal of a set of related Raman spectra, we compare them with traditional approaches on simulated spectra and real Raman spectra.

##### 3.1. Simulated Data

For the simulated data, each spectrum in the set consists of the pure signal , background , and noise :

As our methods are to process multiple correlated spectra, the simulated pure signal is a mixture of three pure spectra illustrated in Figure 1(a) that represent three chemical components. The concentrations of different components are randomly drawn from 0% to 100% in the mixture. By doing this, the simulated pure signals show variations of the peaks. Figure 1(b) depicts some simulated pure signals with different concentrations of components. In addition, the background in (11) is generated by exponential, polynomial, sigmoid, or sine curves with a random amplitude. Finally, the Gaussian white noise is added to the summation of the background and pure signal. Altogether, we randomly generate 30 simulated spectra for each type of backgrounds, as shown in Figures 1(c)–1(f). Their corresponding backgrounds are plotted.