Abstract

The Real Time Analyzer (RTA) utilizing DC- and AC-voltammetric techniques is an in situ, online monitoring system that provides a complete chemical analysis of different electrochemical deposition solutions. The RTA employs multivariate calibration when predicting concentration parameters from a multivariate data set. Although the hierarchical and multiblock Principal Component Regression- (PCR-) and Partial Least Squares- (PLS-) based methods can handle data sets even when the number of variables significantly exceeds the number of samples, it can be advantageous to reduce the number of variables to obtain improvement of the model predictions and better interpretation. This presentation focuses on the introduction of a multistep, rigorous method of data-selection-based Least Squares Regression, Simple Modeling of Class Analogy modeling power, and, as a novel application in electroanalysis, Uninformative Variable Elimination by PLS and by PCR, Variable Importance in the Projection coupled with PLS, Interval PLS, Interval PCR, and Moving Window PLS. Selection criteria of the optimum decomposition technique for the specific data are also demonstrated. The chief goal of this paper is to introduce to the community of electroanalytical chemists numerous variable selection methods which are well established in spectroscopy and can be successfully applied to voltammetric data analysis.

1. Introduction

Electrochemically deposited copper from an acidic bath is a commonly used method of on-chip production of interconnects for microelectronics [1]. The strict requirements of chip manufacturing demand rapid and void-free filling of high-aspect ratio topologies of sub-100 nm features (as small as 18 nm, currently). Rapid superconformal filling is achieved by using multicomponent plating solutions that contain inorganic and organic constituents of various chemical characteristics present at concentrations which differ by several orders of magnitude. The narrow tolerances allowed for the final product require accurate and reliable control of the plating solution with similar or even narrower tolerances.

Accurate and prompt concentration monitoring and control of multicomponent electroplating baths are indispensable in satisfying process specifications for the manufacturing of electronic components while minimizing production costs. The Real Time Analyzer, utilizing numerous voltammetric techniques, is an in situ, online monitoring system that provides a complete chemical analysis of electrometallization solutions. The fully computerized instrumentation requires no specially trained chemical operators and practically eliminates the need for a chemical analytical laboratory.

Typically used organic additives include (i) suppressors (like polyethers) that inhibit the rate of copper deposition at the tops of the trenches and vias by increasing the copper ion reduction overpotential by surface absorption interaction with chloride ion [2, 3]; (ii) accelerators, disulfide molecules, which facilitate the reduction process of copper, most probably by anion-induced adsorption of copper complex(es) at the metallic surface [46]; and (iii) levelers, high molecular weight amines or amides, that act as secondary suppressors to control the grain size of the deposited copper and inhibit overplating [7]. In combination, these additives can achieve accelerated, bottom-up electrodeposition of copper in submicron inlaid features, which permits void-free interconnect wiring in damascene structures. This paper presents a description of RTA data analysis using the example of a suppressor. Although the presentation deals with suppressor analysis, the approach presented is general and is utilized by the RTA for determining the concentration of all plating bath components.

The organic additives undergo significant changes of concentration during bath usage [4, 8, 9]. Therefore, their concentrations should be closely and accurately monitored and replenished in order to maintain their concentrations at the level corresponding to the optimum plating performance.

Ni and Kokot [10] explored the question of whether chemometrics methods enhance the performance of electroanalytical methods and provided evidence for a strong affirmative reply to this question. Despite numerous undisputable benefits, the application of chemometrics to electroanalysis has not been used nearly as widely as in spectroscopy or chromatography.

Wikiel et al. [11] and Jaworski et al. [1216] extensively studied the application of soft modeling techniques for the development of analytical models aimed at accurately and robustly predicting the concentrations of deliberately added bath constituents. Also, Wikiel et al. [17] and Jaworski et al. [18] employed various soft modeling techniques for determinant analyses implemented online as early fault detection routines for industrial electroplating solutions.

Specific waveforms are developed to produce voltammograms having regions which show linear dependence of the current on the concentration of the analyte of interest, while being practically immune to varying concentrations of all other bath constituents. Sometimes, the voltammograms recorded for a single waveform contain several portions (ranges of points of voltammogram) meeting this objective. For such a waveform, its data can be divided into meaningful blocks in order to improve the interpretability. The ability of building a multivariate analytical model utilizing information contained in each of the ranges (blocks) of the voltammogram leads to increased diversity within the data as compared to single-range-based data sets. The diversity among the data results in a greater robustness of the calibration model calculated based on that data.

Hierarchical Principal Component Regression (HPCR) [12, 19], Hierarchical Partial Least Squares (HPLS) [12, 19], Consensus PCR (CPCR) [12, 19], and Multiblock PLS (MBPLS) [12, 19] methods provide tools for handling variables arranged into meaningful blocks, which can be decomposed and subsequently (HPCR, CPCR) or simultaneously (HPLS, MBPLS) regressed against the dependent variables (concentrations).

The chief goal of this presentation is to show that the variable selection methods whose applications are well established in spectroscopy can also be transferred to electroanalytical data. Specifically, this objective is achieved by an introduction of a rigorous, multistep procedure for selecting the blocks of the voltammogram to be subsequently used for analytical model developments and a choice of proper data decomposition technique in order to compress the multivariate voltammetric data and reasonably extract information.

Wikiel et al. [11] and Jaworski et al. [12] introduced a technique coupling Simple Modeling of Class Analogy- (SIMCA-) based modeling power [20] and Least Squares Regression (LSR) for the selection of portions of voltammograms for further factor and regression analyses. The aim of this paper is to extend the criteria for determination of optimum ranges of voltammograms by a novel application of techniques such as Uninformative Variable Elimination- (UVE-) PLS [21, 22], UVE-PCR [23], and PLS-Variable Importance in the Projection (VIP) [24, 25] along with Interval PLS (IPLS) [26], Interval PCR (IPCR), and Moving Window PLS (MWPLS) [27] in electroanalysis. These variable selection techniques have been used in spectroscopy (predominantly near infrared and infrared [22, 26, 27]), chromatography, and mass spectrometry, but with the exception of IPLS [28] they have not been applied to electroanalysis. In recent years, the use of variable selection methods in trace analysis of metals by anodic stripping voltammetry (ASV) surfaces sporadically [28, 29], sometimes being combined with Multivariate Linear Regression on aligned ASV peaks rather than chemometric data compression techniques [29]. The marginal utilization of variable selection techniques in electroanalysis has persisted despite promising results in 1999 [30] obtained by application of genetic algorithm as a variable selection method in the multivariate analysis with PLS of several polarographic and stripping voltammetric data sets, where different interferences were present.

Modern chemometrics is a mature scientific discipline presenting the researcher with a vast number of powerful data decomposition techniques. Although some techniques appear more advanced than the others because of their mathematical complexity, the most suitable methods should be properly chosen depending on the kind of data to be analyzed in order to develop a sound and robust analytical model.

2. Experimental

Voltammetric experiments were performed utilizing the Real Time Analyzer (Technic, Inc., Cranston, USA), a fully computer-control electroanalytical system. Measurements were conducted inside a compact flow-through electrochemical cell (electrode compartment of the Multi-Task Electrochemical Probe (MTEP)) submersed in the temperature-controlled (25 ± 0.2°C) plating solution. The volume of the inner cell compartment was about 20 mL. The solution to be analyzed was circulated inside the probe using a software-controlled diaphragm pump (KNF Neuberger, Balterswil, Switzerland).

The electrochemical cell was a classical three-electrode system with a working electrode made of platinum (Johnson-Matthey) wire (1 mm diameter; 10 mm length), an auxiliary platinum (Johnson-Matthey) foil electrode forming a cylinder around the working electrode, and an in situ generated reference electrode made of copper metal deposited onto a platinum (Johnson-Matthey) wire, immediately before the measurement.

All inorganic chemicals were of Analytical Grade (J.T. Baker, Phillipsburg, NJ). A proprietary organic additive system (Enthone, West Haven, CT) was used in this study.

All calculations were performed in the MATLAB R2012b (The MathWorks, Inc., Natick, MA) environment. The procedure for PLS and scaling routines were taken from the PLS Toolbox 6.5.4. (Eigenvector Research, Inc., Manson, WA, http://www.eigenvector.com). The procedure for MWPLS was taken from iToolbox (University of Copenhagen, Denmark, http://www.models.life.ku.dk/iToolbox). All other procedures were written by the authors.

3. Results and Discussion

3.1. Composition of Training and Validation Sets

The subject of this investigation was the ViaForm® copper plating bath which consisted of six components: copper (II) ion (from copper sulfate), sulfuric acid, chloride ion, suppressor, accelerator, and leveler present for the target concentrations of 0.785 M, 0.820 M, 1.50 mM, 7.00 mL L−1, 7.00 mL L−1, and 0.76 mL L−1, respectively.

The specific objective was to create a calibration model for suppressor in the presence of varying concentration of accelerator and leveler. The concentrations of suppressor, accelerator, and leveler were varied linearly on four levels within the ranges of 5.00 to 9.00 mL L−1, 5.00 to 9.00 mL L−1, and 0.38 to 1.13 mL L−1, respectively. The training set was composed as a 5-level-3-component linear orthogonal array exploring uniformly distributed 25 combinations of suppressor, accelerator, and leveler concentrations. Additionally, the training set was augmented by the voltammetric data recorded for three standard solution containing suppressor, accelerator, and leveler concentrations at the low, target, and high limits. The inorganic bath components, copper (II) ion, sulfuric acid, and chloride ion, were held constant at their target levels. The voltammetric data were recorded for each of the 28 solutions in triplicate resulting in 84 samples . The composition of the entire training set is presented in Table 1.

In order to assess the predictive abilities, the calibration model was externally validated on the validation set. The external validation set consisted of 27 data sets obtained for 9 solutions containing suppressor, accelerator, and leveler varied linearly on three levels within the ranges of 5.50 to 8.50 mL L−1, 5.50 to 8.50 mL L−1, and 0.47 to 1.04 mL L−1, respectively. The validation set was a 3-level-3-component linear orthogonal array exploring uniformly distributed 9 combinations of suppressor, accelerator, and leveler concentrations. The composition of the validation set is presented in Table 2.

Waveform design is a preliminary step of the plating bath analysis utilizing voltammetry [11]. This procedure aims to obtain waveforms which are bath specific and are designed to produce a current response that changes linearly with the concentration of the analyte. It is important in waveform development to minimize the influence of the concentration changes of components other than the analyte on the current response of the voltammogram of the particular analyte.

In some cases, it is possible to obtain a waveform producing voltammograms whose several portions can be utilized for calibration calculation. Figure 1 presents a waveform whose current response linearly varies only with the changing concentration of suppressor.

The voltammetric data throughout this paper are predominantly considered numerical input for chemometric treatment aiming to select specific sections of the voltammograms which are most informative for subsequent calibration calculation. Therefore, the abscissa of the voltammograms in this paper is the index point of the voltammogram rather than the applied potential as is usually used. The values of the index points of the voltammogram are transformed values of the applied potential. The index points of voltammograms are a linear function of the voltammetric current sampling time. For the voltammogram of Figure 1, the current is sampled at the end of each step of duration of = 1 ms. For instance, the 3350th point of the voltammogram is sampled at 3.350 s of the duration of voltammetric measurement when the applied potential is −208.7 mV versus during the forward potential sweep of 9th CV cycle. The index points of the voltammogram are independent variables which are necessary for a unique identification of variables and relevant for subsequent chemometric selection. The applied potential values are not unique identifiers of independent variables, as the same potential values are applied several times during multicyclic voltammogram. For instance, the same potential value of −208.7 mV versus is also applied at 3.398 s of the voltammograms of Figure 1 corresponding to 3398th index point of voltammogram during the reverse potential sweep of 9th CV cycle. The relationship between applied potential values and index point of voltammograms/time is presented in upper left corner of Figure 1. In order to present the entire voltammograms in Figure 1, scaling was applied which prevents the detailed quantitative visual analysis of the voltammograms. However, one can notice cyclical pattern among these voltammograms. Each of the voltammograms consists of ten distinctive cycles. Figure 1 also shows the magnified portion of voltammograms focusing on the ninth of the ten cycles.

The recorded current corresponds to the effect of suppressor on copper ion reduction (index points 3340–3400) and on copper metal oxidation (index points 3420–3460). Examination of the voltammograms shown in Figure 1 enables estimation that the current response in some portions of these voltammograms is linearly dependent on concentration of suppressor, while being independent of concentration variations of other bath constituents (see composition on C1, C8, C15, C19, and C21 solutions in Table 1). Analogous conclusions can be drawn while visually analyzing the individually magnified last five cycles of voltammograms of Figure 1.

3.2. Determination of the Optimum Ranges for a Calibration
3.2.1. SIMCA Modeling Power and LSR

Wikiel et al. [11] and Jaworski et al. [12] introduced a technique which proved to be helpful for selecting the range(s) of voltammograms to be taken for further decomposition and regression analysis. This method was employed [12] for selecting applicable ranges of various voltammograms to be subsequently analyzed by hierarchical and multiblock decomposition/regression techniques. This paper focuses on the applicable points of voltammograms of a single waveform (Figure 1). The method of determination of the optimum ranges of voltammograms presented in this paper is significantly improved and extended over that presented in [1113] by introducing a two-step (whole voltammogram, then optimization within individual blocks) SIMCA-based selection by confirming the ranges by several independent approaches to exclude artifacts.

The initial stage of determining of the most promising ranges of the voltammogram to be taken for the calibration calculation utilizes two independent procedures applied for each index point of that voltammograms within the training set data:(i)correlation calculation based on the univariate LSR,(ii)SIMCA -based on calculation of modeling power [20].

The LSR-based method provides information about which range(s) of the voltammogram shows the greatest correlation with the concentration of the component to be calibrated. It also determines the range where the current responses depend only on changes in concentration of the component of interest. In this method, the regression equation for each index point of the training set voltammograms of Figure 1 is calculated by the least squares method. The regression equation obtained is employed to self-predict concentrations for the training set. The self-predicted concentrations are correlated with the corresponding actual concentrations within the training set. The squared correlation coefficients, , are calculated for each th point of the voltammogram.

The SIMCA-based method gives information about noise-to-signal ratio for each point within the chosen range. This method [20] utilizes Principal Component Analysis (PCA) calculated residuals to obtain the modeling power, , for each th point of the voltammogram: where (error) is the square root of the residual variance for the number of factors of for the th point of the voltammogram:where is the element of the matrix of residuals and is the number of samples of the training set (in the example discussed throughout the paper, = 84).

is the square root of the meaningful variance for th point of the voltammogram: where is the element of the training set matrix .

The modeling power, as implemented in the initial stage of the method for selection of optimum ranges, provides information about noise-to-signal ratio for each point of the voltammogram. As approaches 1, this feature becomes highly relevant; conversely, as it approaches 0, the feature approaches zero utility in the model [20].

Figure 2 presents the parameters and , calculated for the entire voltammograms (all points of the voltammograms) of the waveform of Figure 1 recorded for the training set. It can be seen that values of both parameters and , tend to change abruptly with sharp oscillations between extremes related to useful and useless values for both modeling power and regression. parameters were calculated based on the PCA decomposition using the number of factors as there is expected to be only one predominant source of variance in the selected ranges, namely, the changing concentration of suppressor. The preliminary selection of ranges of the voltammogram (Table 3) was conducted based on the criteria that the points of the voltammogram within each range should change monotonously and that every value of and , within the range must not fall below 0.95 and 0.80, respectively. Also, each selected range was continuous, including all points of the voltammogram within the limits of the range.

The subsequent step (second step of the SIMCA modeling power) is to determine the optimum ranges of voltammograms based solely on the modeling power technique. The voltammograms of the training set for each of the preliminarily selected ranges (Table 3) are then analyzed individually. In other words, separate modeling power analyses (each requiring another PCA decomposition) are applied individually for each of the ranges rather than for all of the points of the voltammogram (as the data of Figure 2 was obtained). Figure 3 presents modeling power values calculated individually for each of ranges of Table 3 as compared to modeling power values calculated for the whole voltammogram.

One can notice in Table 3 that the average values calculated within each range obtained by individual decomposition of each of the ranges (Figure 3) are systematically higher than average values corresponding to the same points of the voltammogram but obtained by a decomposition of the entire voltammogram (Figure 2). This has an obvious explanation, as the PCA decomposition of the entire voltammogram needs also to model the variances within the portion of the voltammograms which have no utility for the suppressor calibration (e.g., those showing very low values of squared correlation coefficient) (Figure 2). The lower values of average obtained for the entire voltammogram decomposition are the result of the compromise taken by decomposing unselected data (ranges) and therefore incorporating unnecessarily additional numerical noise. However, the values for some variables (usually for those close to the beginning or end of the range) obtained via individual decomposition of the preselected ranges may be lower than corresponding values obtaining via decomposition of entire voltammetric data (e.g., see Figure 3; several last variables of 13th range are of lower than 0.80). The individual selection of the ranges of voltammogram is shown in Table 4. This secondary selection was achieved by individually analyzing modeling power values for each of ranges and by selecting points of the voltammograms corresponding to the values of which must not be lower than 0.85.

As the values of modeling power, calculated via PCA decomposition with a single factor, presented in Figure 3 approach one, it means that there is single predominant, orthogonal variance within that data. This variance is caused by the varying concentration of suppressor, as the squared correlation coefficients calculated by LSR for these ranges of voltammograms are also approaching unity (Figure 2). This predominant variance is a focal point of the calculation of the analytical model aiming to predict the concentration of suppressor. Other, relatively small orthogonal variances are also present in the data. They may be worth of exploring for the final tuning of the analytical model but it should be determined by cross- and external validations whether they would be worth incorporating into the analytical model optimized for suppressor-caused variance only. The trade-off may be an unnecessary incorporation of numerical noise which lowers the predictive ability of the model. In order to avoid such a scenario, the final tuning of the model needs to be conducted based on external validation.

3.2.2. Uninformative Variable Elimination

The uninformative variables (index points of voltammogram) increase the bias and imprecision of the latent variables [21, 22]. While introducing UVE-PLS, Centner et al. [21] pointed out “in the original PLS method all variables are used; PLS is a so-called full-spectrum method. However, one can wonder whether it is useful to include all variables, because some of them may be noisy and/or contain nonrelevant information.” Elimination of uninformative variables leads to more parsimonious and more robust models and to better prediction [21, 22]. In the UVE method the random (and therefore uninformative) variables are appended to the original data matrix resulting in matrix to calculate a reliability criterion for each original and each added random variable and to retain only the experimental variables for which the value of the reliability criterion is larger than the values obtained for the random variables. The appended artificial random variables are multiplied by a small constant (1 × 10−10) in order to have a negligible influence on the model while retaining the same random variation within artificial variables. The goal of UVE-PLS and UVE-PCR is not variable selection in the sense that one tries to find the best subset of variables for fitting or prediction of a model but elimination of those variables that are useless. Therefore, UVE-PLS (UVE-PCR) is not an equivalent but a supplementary technique to the coupled LSR and SIMCA modeling power.

Because the latent variables are liner combinations of the original ones, the PLS and PCR models can be expressed aswhere is the regression vector described for the PCR by the following equation:where is the vector of Inverse Least Squares (ILS) regression coefficients for PCR:For each th variable the reliability regression criterion for UVE-PCR [23] and UVE-PLS [21, 22] is defined aswhere is a mean and is a standard deviation from the vector of coefficients obtained by (leave-one-out) jackknifing .

The UVE-PLS and UVE-PCR (number of latent variables for decomposition, = 1) were applied to the all points of the voltammograms of Figure 1 and augmented variables corresponding to random noise data. The results obtained for regression reliability are shown in Figure 4. The performance of UVE-PLS and UVE-PCR is equivalent.

For each variable , the greater the absolute value of regression reliability , the more informative the th variable for the model. By visual comparison of data in Figures 2 and 4, it can be seen that the elevated absolute values of regression reliability overlap with modeling power and approaching unity. The cut off criterion is determined by the maximum absolute value of regression reliability calculated for the random variables (right half of the Figure 4) which are 32.6 and 31.7 for UVE-PLS and UVE-PCR, respectively.

Table 5 lists in columns three and four minimum absolute values of regression reliability calculated for individually selected ranges of variables (Table 4), for UVE-PLS and UVE-PCR, respectively. The same minimum value of 222 was obtained for range of = 13 with both methods, UVE-PLS and UVE-PCR. As this value lays significantly above the cut off level, both methods UVE-PLS and UVE-PCR confirm the correctness of choosing the ranges of Table 4 for building of an analytical model.

3.2.3. Variable Importance in the Projection

PLS-VIP is a combined measure of how much a variable contributes to describe the two sets of data: the dependent and the independent variables [24, 25], which in the studied case correspond to concentration of suppressor and voltammetric current values, respectively. For the number of factors = 1, the expression describing VIP value for the th variable is given by where is the PLS weight value for variable j and is the Frobenius norm of the weight vector defined by the following expression:

The weights in a PLS model reflect the covariance between the independent and dependent variables and the inclusion of the weights is what allows VIP to reflect not only how well the dependent variable is described but also how important that information is for the model of independent variables. Since the average of squared VIP scores equals unity, values smaller than one indicate nonimportant variables.

The PLS-VIP (number of latent variables for decomposition, ) was applied to all points of the autoscaled voltammograms of Figure 1 and autoscaled dependent variables (concentrations). The results obtained for VIP are shown in Figure 5 and Table 5. All index points of the voltammogram (variables ) preselected with modeling power and LSR (Table 4) are also of highest importance in the projection based on PLS-VIP analysis, with the minimum value of VIP of 1.47 obtained for numerous individually selected ranges (Table 5). As this value lies significantly above the cut off level of one, the application of PLS-VIP also confirms the correctness of choosing the ranges of Table 4 for building of an analytical model.

3.2.4. Interval Partial Least Squares and Interval Principal Component Regressions

In IPLS [26] and IPCR the voltammograms are divided into a number of nonoverlapping, consecutive intervals and PLS and PCR models are developed for each of these intervals. The aim is to find intervals which give better predictions than the predictions obtained when using the full voltammogram. The comparison of interval performance is mainly based on the Root Mean Square Error of Cross-Validation (RMSECV). The RMSECV values obtained for the training set voltammograms with IPLS and IPCR for the arbitrarily chosen interval of the width of 31 points of the voltammograms (variables ) is presented in Figures 6 and 7, respectively. The points of the voltammograms were divided into 125 nonoverlapping consecutive intervals and an addition last interval with a width of 36 to accommodate all = 3911 index points of voltammogram (125 × 31 + 36 = 3911).

It can be seen in Table 5 that the ranges of points selected for calibration for the voltammograms in Table 4 correspond to the lowest values of RMSECV in Figures 6 and 7. However, there are two problems with using IPLS and IPCR for variable selection related to the specificity of voltammetric data of Figure 1. First, some of the selected ranges in Table 4 (especially that for oxidation processes) are very narrow ( = 14, = 13, 0.33% of total number of variables J). The second problem is related to the fact that these narrow ranges are neighboring points of voltammograms of very poor correlation and have no utility to the model (Figures 2, 4, and 5). These two facts may lead to a situation where there is insufficient representation of “proper” variables in a selected interval to obtain a RMSECV value lower than that obtained for cross-validation of entire voltammogram.

In order to address the issues outlined in the paragraph above, MWPLS was implemented. This technique allows a sufficient representation of “proper” variables neighboring on both sides, the variable for which the RMSECV is calculated. The results obtained with MWPLS are presented in Figure 8. One can notice in Table 5 that even the maximum values of RMSECV obtained with the MWPLS for the individually selected ranges of Table 4 correspond to the minor values of the graph of Figure 8. All three methods, IPLS, IPCR, and MWPLS, confirm the selection ranges in Table 4.

There must be an agreement about the outcome of the selection of variables for all methods presented (LSR, two-step SIMCA modeling power, UVE-PLS (UVE-PCR), PLS-VIP, IPLS, IPCR, and MWPLS) in order to accept these variables for further model development. As Andersen and Bro [31] pointed out, it is always instructive to compare the results from several types of variable selection to assess whether they complement each other.

3.3. Selection of the Most Suitable Data Decomposition Method

Multivariate data decomposition techniques are employed in chemometrics in order to compress vast amount of data while extracting the significant information. The techniques commonly used can be divided into three major groups: (i) two-way techniques: PCA and PLS; (ii) hierarchical and multiblock techniques: HPCA [19], HPLS [19], CPCA [19], and MBPLS [19]; and (iii) multiway techniques: Generalized Rank Annihilation Method (GRAM) [32], PARAlell FACtor Analysis (PARAFAC) [32], Multilinear PLS (N-PLS) [32], and Tucker models [32]. There is also a hybrid method called Multiway PCA (MPCA) [33] described as “poor man’s” multiway technique [34] as it relies upon a two-way PCA and rearrangement by unfolding the original multiway data.

The structure of the voltammetric data of the training set to be decomposed is described in Table 4. It consists of fourteen ( = 14) blocks, each of them containing = 84 samples. The number of variables of each block varies from 13 to 52 depending on the block number . As the data consists of several blocks, the two-way methods, PCA and PLS, are not applicable. By extensive selection, this data could be arranged into a three-way array of the dimensions , where would correspond to that of the narrowest ( = 14) range variables = 13. For instance, in order to fit such a three-way array every fourth variable would need to be extracted from the widest range of = 52 points. Such a selection would mean a loss of 75% of the data for the widest range already at the preliminary stage of the data arrangement (prior to the actual analysis).

Qualitatively, the data of Table 4 can be divided into two distinct groups, each consisting of seven ranges. The division is based on the two electrode processes investigated: reduction and oxidation. Although quantitatively different (Figure 1), the qualitative difference among the data of various blocks investigating, for instance, oxidation, may not be predominant enough to consider these groups as members of different slab, a layer (submatrix) of a three-way array. Application of multiway techniques for the higher number of factors for insufficiently dissimilar slabs may result in some factors to be highly correlated.

One may consider application of MPCA for decomposition of the multiblock training data of Table 4 following their unfolding into two-way matrix of the dimension . However, even after columnwise autoscaling, the variables will not be equally (or controllably) represented. The data of the wider blocks will have a greater representation than those of the narrower ones.

The hierarchical and multiblock data decomposition techniques [12, 19] calculate for each of blocks the block scores and block loadings using regular two-way decomposition techniques. Subsequently, all block scores are combined into the super scores prior to the regression. The hierarchical and multiblock data decomposition techniques are the most suitable to be employed for treatment of voltammetric data defined in Table 4. Application of these techniques to the voltammetric data for suppressor avoids making unnecessary compromises or risking the accuracy and soundness of the data representation, which otherwise may be needed when arranging data in three-way arrays. The scaling employed in this paper provides equal representation of each of K = 14 blocks, regardless of the width of that blocks . Before applying CPCA, HPCA, HPLS, and MBPLS, the data is block-scaled by dividing by the square root of the number of variables in the block in order to equalize the representation of various blocks [19]:

Westerhuis et al. [19] derived an algebraic proof of the equivalence of the CPCA and MBPLS to the standard PCA and PLS, respectively, when the same variable scaling was applied for these methods. Westerhuis et al. [19] recommended utilizing the standard PCA and PLS methods which require less computation and give a better estimation for the scores in cases where there are missing data because correlations among all variables can be used for estimation of the scores instead of only the correlations among the variables in the specific block.

3.4. Validation of Analytical Models

Calibration calculations were conducted using comparatively CPCR, HPCR, HPLS, and MBPLS methods for the training set data ( = 84 samples). The predictive ability of the calibration models was verified on the external validation set containing voltammograms recorded for = 27 different samples. Prior to being projected on the calibration model, the validation set was scaled using the scaling parameters of the training set. The optimum number of factors was determined for each of the methods based on the analysis of Predictive Residual Sum of Squares (PRESS) values calculated for the residual concentrations of the external validation set employing the following expression:where describes the original (not centered) actual concentration of th sample in the validation set, while denotes the rescaled predicted concentration using -factor decomposition of th sample of the validation set. The results obtained are presented in Table 6.

Additionally, the predictive performance of the CPCR, HPCR, HPLS, and MBPLS analytical models was comparatively assessed by determining the values of the following two derived parameters: mean no-sign relative-to-target error of prediction and mean no-sign relative error of prediction (Table 6). The mean no-sign relative-to-target error of prediction (MNSRTEP) is defined by the equationwhere is the target-level concentration. In the case of the specific bath composition in this investigation the target concentration of accelerator was 7.00 mL L−1.

The mean no-sign relative error of prediction (MNSREP) is described by the following equation:

While examining the results in Table 6, one can see based on PRESS, MNSRTEP, and MNSREP values that the vast majority of the total variance has already been captured by the first factor. The parameters , PRESS, MNSRTEP, and MNSREP have practically the same values for = 1 for all three techniques: CPCR, HPCR, and MBPLS. Although the values obtained for HPLS are different than that for the other methods, the difference is not significant. The introduction of a second factor/latent variable introduces slight improvement of the values of parameters PRESS, MNSRTEP, and MNSREP, along with improvement of correlation. The incorporation of the third factor/latent variable into the decomposition worsens the values of all parameters: , PRESS, MNSRTEP, and MNSREP for MBPLS and HPLS. Usually, the optimum number of factors/latent variables should be determined based on the first local minimum of the value. In the specific case discussed, the improvement of the (already good) values of parameters PRESS, MNSRTEP, and MNSREP by the addition of second factor justifies using of number of factors = 2 for final tuning of the analytical model. Therefore, our conclusion is to employ two factors/latent variables for building of the analytical model, regardless of the chosen chemometric technique for data decomposition. The selected optimum parameters are presented in bold in Table 6.

The performance of all four techniques is equivalent to the exception of the HPLS and MBPLS results for = 3, which are worse than that of the other methods. On the other hand, the performances of HPLS and MBPLS for = 3 for take-one-out cross-validatory calculation within the training set (Table 7) are superior (especially for MBPLS) as compared to other methods for the same number of factors/latent variables used for decomposition. Usually, a combination of better cross-validatory (or self-prediction) performance with worse predictive performance of external data is evidence of overfitting of the model. The PLS-based models tend to be more overfit than the PCA-based ones. Again, by keeping the number of factors low, one can significantly reduce the probability of such occurrence (see the practically equivalent performance of all techniques for = 1 and for = 2 in Table 6).

Figure 9 presents a comparison of the actual concentrations of suppressor with those predicted by CPCR, MBPLS, HPCR, and HPLS for the external validation set of 27 samples for = 2 demonstrating equivalent performance of all techniques and providing evidence for the absence of artifacts in the selected voltammetric data.

4. Conclusions

This paper has introduced several variable selection techniques (UVE-PLS, UVE-PCR, PLS-VIP, IPLS, IPCR, and MWPLS) to the field of electroanalysis which are well established for analysis of spectroscopic data. Specifically, the rigorous, multistep procedure of selecting the blocks of the voltammogram to be used subsequently for analytical model development based on LSR, two-step SIMCA modeling power, UVE-PLS, UVE-PCR, PLS-VIP, MWPLS, IPLS, and IPCR focused on individual blocks was introduced. To ensure feasibility of the model, several variable selection methods were utilized comparatively to verify that their results do not contradict each other. Detailed criteria for choosing the decomposition technique proper for the data in order to compress the multivariate data and reasonably extract information were presented. The optimization of the number of factors based on external validation and cross-validation was presented. As a general recommendation, a few variable selection methods need to be implemented concurrently for the investigated multivariate data set, as their consistent-to-each-other performance provides an evidence for an absence of artifacts in that data. All the methodology for data selection can be automated and ultimately can lead to a fully automatic data selection process that would reduce the time and effort of method development significantly.

Electroanalytical chemists would substantially benefit from utilization of the variable selection methods allowing them to focus only on relevant portions of voltammetric responses while eliminating those uninformative. Otherwise, the incorporation of the irrelevant variables (both random and systematic) into a multivariate model leads to less precision (higher variance due to imbedded error). By analogy to other disciplines of analytical chemistry, variable selection should become a routine step of multivariate data pretreatment in electroanalytical chemistry by utilizing existing chemometric techniques. This paper proves that existing variable selection methods can be transferred to electroanalytical data.

Abbreviations

ASV:Anodic Stripping Voltammetry
CPCR:Consensus Principal Component Regression
CV:Cyclic Voltammetry
GRAM:Generalized Rank Annihilation Method
HPCR:Hierarchical Principal Component Regression
HPLS:Hierarchical Partial Least Squares
IPCR: Interval Principal Component Regression
IPLS:Interval Partial Least Squares
LSR:Least Squares Regression
MBPLS:Multiblock Partial Least Squares
MNSREP:Mean No-Sign Relative Error of Prediction
MNSRTEP:Mean No-Sign Relative-to-Target Error of Prediction
MPCA:Multiway Principal Component Analysis
MTEP:Multi-Task Electrochemical Probe
MWPLS:Moving Window Partial Least Squares
N-PLS:Multilinear Partial Least Squares
PARAFAC:Parallel Factor Analysis
PCA:Principal Component Analysis
PCR:Principal Component Regression
PLS:Partial Least Squares
PLS-VIP:Variable Importance for Prediction by Partial Least Squares
PRESS:Predictive Residual Sum of Squares
RMSECV:Root Mean Square Error of Cross-Validation
SIMCA:Simple Modeling of Class Analogy
UVE-PCR:Uninformative Variable Elimination by Principal Component Regression
UVE-PLS:Uninformative Variable Elimination by Partial Least Squares.

Conflicts of Interest

Dr. Aleksander Jaworski, Dr. Hanna Wikiel, and Dr. Kazimierz Wikiel declare that there are no conflicts of interest regarding the publication of the paper.

Acknowledgments

The authors would like to thank Dr. Allan Reed for his comments on the manuscript.