Abstract

Many information processing problems can be transformed into some form of eigenvalue or singular value problems. Eigenvalue decomposition (EVD) and singular value decomposition (SVD) are usually used for solving these problems. In this paper, we give an introduction to various neural network implementations and algorithms for principal component analysis (PCA) and its various extensions. PCA is a statistical method that is directly related to EVD and SVD. Minor component analysis (MCA) is a variant of PCA, which is useful for solving total least squares (TLSs) problems. The algorithms are typical unsupervised learning methods. Some other neural network models for feature extraction, such as localized methods, complex-domain methods, generalized EVD, and SVD, are also described. Topics associated with PCA, such as independent component analysis (ICA) and linear discriminant analysis (LDA), are mentioned in passing in the conclusion. These methods are useful in adaptive signal processing, blind signal separation (BSS), pattern recognition, and information compression.

1. Introduction

In information processing such as pattern recognition, data compression and coding, image processing, high-resolution spectrum analysis, and adaptive beamforming, feature extraction or feature selection is necessary to deal with the large storage of raw data. Feature extraction is a dimensionality-reduction technique, mapping high-dimensional patterns onto a lower-dimensional space by extracting the most prominent features using orthogonal transforms. The extracted features do not have any physical meaning. In contrast, feature selection decreases the size of the feature set or reduces the dimension of the features by discarding the raw information according to a criterion.

Orthogonal decomposition is a well-known technique to eliminate ill-conditioning. The Gram-Schmidt orthonormalization (GSO) is suitable for feature selection. This is due to the fact that the physically meaningless features in Gram-Schmidt space can be linked back to the same number of variables of the measurement space, thus resulting in no dimensionality reduction. The GSO procedure starts with QR decomposition of the transpose of the full feature matrix, 𝐗𝑇, where 𝐗=[𝐱1,𝐱2,,𝐱𝑁]. QR decomposition can be performed by using the Householder transform or the Givens rotation [1], which is suitable for hardware implementation. The GSO transform can be used for feature subset selection; it inherits the compactness of the orthogonal representation and at the same time provides features that retain their original meaning.

An orthogonal transform can decompose the correlations among the candidate features so that the significance of the individual features can be evaluated independently. Principal component analysis (PCA) is a well-known orthogonal transform that is used for dimensionality reduction. Another popular technique for feature extraction is linear discriminant analysis (LDA), also known as Fisher’s discriminant analysis [2, 3]. Taking all the data into account, PCA computes vectors that have the largest variance associated with them. The generated PCA features do not have clear physical meanings. In contrast, LDA searches for those vectors in the underlying space that best discriminate among the classes rather than those that best describe the data.

For a 𝐽1-dimensional data set {𝐱𝑖} of size 𝑁, PCA [4] generates a 𝐽2-dimensional feature set {𝐲𝑖} of the same size, 𝐽1>𝐽2, by using the linear transformation 𝐲𝑖=𝐖𝑇𝐱𝑖. The weight matrix 𝐖 can be solved under different criteria such as the output variance maximization or MSE minimization. In comparison with the GSO transform, PCA generates each of its features based on the covariance matrix of all the 𝑁 vectors 𝐱𝑖, 𝑖=1,,𝑁. Dimensionality reduction is achieved by dropping the variables with insignificant variances. PCA is often used to select inputs, but it is not always useful, since the variance of a signal is not always related to the importance of the signal, for non-Gaussian signals. An improvement on PCA is provided by nonlinear generalizations of PCA, which extend the ability of PCA to incorporate nonlinear relationships in the data. Two-dimensional PCA [5] is designed for image feature extraction.

In this paper, we give a state-of-the-art introduction to various neural network implementations and algorithms for PCA and its extensions. This paper is organized as follows. In Section 2, we introduce the Hebbian learning rule and Oja’s learning rule. Section 3 defines the PCA problem. Various PCA networks and algorithms are treated in Sections 47. PCA algorithms based on the Hebbian rule are expanded in Section 4. In Section 5, least means squared error-based PCA methods are dealt with. Other optimization-based PCA methods are described in Section 6. PCA based on the anti-Hebbian rule is treated in Section 7. Nonlinear PCA is addressed in Section 8. Section 9 is dedicated to minor component analysis (MCA). In Section 10, we describe various localized PCA strategies. Section 11 extends all the methods to the complex-valued domain. Some other generalizations of PCA such as constrained PCA, generalized EVD, and two-dimensional PCA are described in Section 12. In Section 13, the cross-correlational PCA asymmetric network and SVD are described. Canonical correlation analysis is described in Section 14. A simulation example of PCA is given in Section 15. A brief summary is given in Section 16, and independent component analysis (ICA) and linear discriminant analysis (LDA) are also mentioned in passing in this section.

2. Hebbian Learning Rule and Oja’s Learning Rule

The stochastic approximation theory [6], introduced by Robbins and Monro in 1951, is an important tool for analyzing stochastic discrete-time systems including the classical gradient-descent method. Given a stochastic discrete-time system of the form 𝐳(𝑡+1)=𝐳(𝑡)+𝜂(𝑡)(𝐟(𝐳,𝑡)+𝐧(𝑡)),(1) where 𝐳 is the state vector, 𝐟(𝐳,𝑡) a finite nonzero vector that uses functions as entries, and 𝐧(𝑡) an unbiased noisy term. Assuming that {𝜂(𝑡)} is a sequence of positive numbers satisfying the Robbins-Monro conditions [6] 𝑡=1𝜂(𝑡)=,𝑡=1𝜂2(𝑡)<,(2) then the stochastic system (1) can be transformed into a deterministic differential equation d𝐳d𝑡=𝐟(𝐳,𝑡).(3) If (3) converges to a fixed point 𝐳, then (1) also converges to 𝐳 as 𝑡 with probability one. The Robbins-Monro conditions [6] require 𝜂(𝑡)0 as 𝑡. Typically, one can select 𝜂(𝑡)=1/(𝛼+𝑡), 𝛼0 being a constant, or 𝜂(𝑡)=1/𝑡𝛽, 1/2𝛽1 [7, 8].

The Hebbian learning rule was introduced in [9]. For a single neuron, the Hebbian rule is written as 𝐰(𝑡+1)=𝐰(𝑡)+𝜂𝑦(𝑡)𝐱𝑡,(4) where the learning rate 𝜂>0, 𝐰 is the weight vector from the input to the neuron, 𝐱𝑡 is an input vector presented at time 𝑡, and the output of the neuron 𝑦(𝑡) is defined by 𝑦(𝑡)=𝐰𝑇(𝑡)𝐱𝑡.(5)

For a stochastic input vector 𝐱, applying an analysis on (4) using the stochastic approximation theory, we get the only equilibrium state 𝐰=𝟎 by maximizing the criterion 𝐸Hebb=𝐸[𝑦2], where 𝐸[] is the expectation operator [10, 11]. The solution 𝐰=𝟎 is unstable, which drives 𝐰 to infinite magnitude, with a direction parallel to that of the eigenvector of 𝐂=𝐸[𝐱𝐱𝑇] corresponding to the largest eigenvalue [11]. To prevent the divergence of the Hebbian rule, 𝐖 can be normalized after each iteration [12, 13], and this leads to the normalized Hebbian rule. Other methods, such as Oja’s rule [14], Yuille’s rule [15], Linsker’s rule [16], and Hassoun’s rule [11], add a weight-decay term to the Hebbian rule to stabilize the algorithm.

Oja’s rule introduces a weight decay term into the Hebbian rule [14] to prevent instability 𝐰(𝑡+1)=𝐰(𝑡)+𝜂𝑦(𝑡)𝐱𝑡𝜂𝑦2(𝑡)𝐰(𝑡).(6) Oja’s rule converges to a state that maximizes 𝐸Hebb subject to 𝐰=1. The solution is the principal eigenvector of 𝐂 [14]. For small 𝜂, Oja’s rule is proved to be equivalent to the normalized Hebbian rule [14]. Based on stochastic approximation theory, the stable solutions exist only at 𝐰=±𝐜1, if 𝜆1𝜆2, where 𝐜1 is the eigenvector corresponding to 𝜆1, and 𝜆1 and 𝜆2 are the two largest eigenvalues of 𝐂 [10, 11]. Thus, Oja’s rule always converges to the principal component of 𝐂.

The Robbins-Monro conditions are not practical for implementation, especially for learning nonstationary data. Zufiria [17] has proposed to convert the stochastic discrete-time algorithms into their deterministic discrete-time formulations that characterize their average evolution from a conditional expectation perspective. Analysis based on this method guarantees the convergence of Oja’s rule by selecting some constant learning rate. Oja’s rule almost always converges exponentially to the unit eigenvector associated with the largest eigenvalue of 𝐂, starting from points in an invariant set [18]. A constant learning rate for fast convergence is suggested as 𝜂=0.618/𝜆1 [18].

3. Principal Component Analysis

PCA is based on the spectral analysis of the second-order moment matrix called correlation matrix that statistically characterizes a random vector. In the zero-mean case, this matrix becomes the covariance matrix. In the area of image coding, PCA is known as Karhunen-Loeve transform (KLT) [4], which exploits correlation between neighboring pixels or groups of pixels for data compression.

PCA allows the removal of the second-order correlation among given random processes. By calculating the eigenvectors of the covariance matrix of the input vector, PCA linearly transforms a high-dimensional input vector into a low-dimensional one whose components are uncorrelated. PCA is directly related to singular value decomposition (SVD), and the most common way to perform PCA is via SVD of the data matrix. However, the capability of SVD is limited for vey large data set.

PCA is often derived by optimizing some information criterion, such as the maximization of the variance of the projected data or the minimization of the reconstruction error. The objective of PCA is to extract 𝑚 orthonormal directions 𝐰𝑖𝑅𝑛, 𝑖=1,2,,𝑚, 𝑚<𝑛, in the input space that account for as much of the data’s variance as possible. Subsequently, an input vector 𝐱𝑅𝑛 may be transformed into an 𝑚-dimensional space without losing essential intrinsic information. The vector 𝐱 can be represented by being projected onto the 𝑚-dimensional subspace spanned by 𝐰𝑖 using the inner products 𝐱𝑇𝐰𝑖, hence achieving dimensionality reduction.

PCA finds those unit directions 𝐰𝑅𝑛, along which the projections of the input vectors, known as the principal components (PCs), 𝑦=𝐱𝑇𝐰, have the largest variance 𝐸PCA𝑦(𝐰)=𝐸2=𝐰𝑇𝐂𝐰𝐰=𝑇𝐂𝐰𝐰2,(7) where 𝐰=𝐰/𝐰. Based on an analysis using the stochastic approximation theory [10, 11], when 𝐰=𝛼𝐜1, where 𝛼 is a scalar, 𝐸PCA(𝐰) takes its maximum value. When 𝛼=1, 𝐰 becomes a unit vector.

By repeating maximization of 𝐸PCA(𝐰) but limiting 𝐰 to be orthogonal to 𝐜1, the maximum of 𝐸PCA(𝐰) is equal to 𝜆2 at 𝐰=𝛼𝐜2. Following this deflation procedure, all the 𝑚 principal directions 𝐰𝑖 can be derived [11]. The projections 𝑦𝑖=𝐱𝑇𝐰𝑖, 𝑖=1,2,,𝑚, are the PCs of 𝐱.

A linear least squares (LS) estimate ̂𝐱 can be constructed for the original input 𝐱 as ̂𝐱𝑡=𝑚𝑖=1𝑎𝑖(𝑡)𝐰𝑖. This is a data reconstruction process. The reconstruction error 𝐞 is the difference between the original and reconstructed data ̂𝐞=𝐱𝐱=𝑛𝑖=𝑚+1𝑎𝑖𝐰𝑖.(8) Naturally, 𝐞 is orthogonal to ̂𝐱. Each principal component 𝑎𝑖 is a Gaussian with zero mean and variance 𝜎2𝑖=𝜆𝑖.

4. Hebbian Rule-Based Principal Component Analysis

Neural PCA originates from the seminal work by Oja [14]. For a single neuron, the output is given by 𝑦=𝐰𝑇𝐱,(9) where the weights to the neuron 𝐰=(𝑤1,,𝑤𝐽1)𝑇.

The single-neuron model was extended to a 𝐽1-𝐽2 feedforward network model to extract the first 𝐽2 PCs [7]. The architecture of the PCA network is shown in Figure 1. The output of the network is given by 𝐲=𝐖𝑇𝐱,(10) where 𝐲=(𝑦1,𝑦2,,𝑦𝐽2)𝑇, 𝐱=(𝑥1,𝑥2,,𝑥𝐽1)𝑇, 𝐖=[𝐰1𝐰2𝐰𝐽2], and 𝐰𝑖=(𝑤1𝑖,𝑤2𝑖,,𝑤𝐽1𝑖)𝑇, 𝑤𝑗𝑘 being the weight from the 𝑗th input to the 𝑘th neuron.

4.1. Subspace Learning Algorithms

By using Oja’s rule (6), 𝐰 will converge to a unit eigenvector of 𝐂, and the variance of 𝑦 is maximized. For zero-mean input data, this extracts the first PC [14, 19]. In order to keep the algorithm convergent, 0<𝜂(𝑡)<1/1.2𝜆1 is required [7]. If 𝜂(𝑡)1/𝜆1, 𝐰 will not converge to ±𝐜1 even if it is initially close to the target [20]. One can select 𝜂(𝑡)=0.5[𝐱𝑇𝑡𝐱𝑡] at the beginning and gradually decrease 𝜂 [7].

The symmetrical subspace learning algorithm (SLA) [7] is a learning algorithm for the PCA network. The SLA is based on Oja’s rule and is given by [7] 𝐰𝑖(𝑡+1)=𝐰𝑖(𝑡)+𝜂(𝑡)𝑦𝑖𝐱(𝑡)𝑡̂𝐱𝑡,̂𝐱𝑡=𝐖𝐲.(11) After the algorithm converges, 𝐖 is roughly orthonormal and the columns of 𝐖, namely, 𝐰𝑖, 𝑖=1,,𝐽2, converge to some linear combination of the first 𝐽2 principal eigenvectors of 𝐂 [7, 21], which is a rotated basis of the dominant eigenvector subspace. This is called principal subspace analysis (PSA). The value of 𝐰𝑖 is dependent on the initial conditions and training samples. The corresponding eigenvalues 𝜆𝑖 approximate 𝐸[𝑦2𝑖] and can be adaptively estimated by ̂𝜆𝑖1(𝑡+1)=1̂𝜆𝑡+1𝑖1(𝑡)+𝑦𝑡+12𝑖(𝑡+1).(12)

Weighted SLA introduces asymmetry into the SLA [22, 23]: 𝐰𝑖(𝑡+1)=𝐰𝑖(𝑡)+𝜂(𝑡)𝑦𝑖𝐱(𝑡)𝑡𝛾𝑖̂𝐱𝑡,̂𝐱𝑡=𝐖𝐲,(13) for 𝑖=1,,𝐽2, where the coefficients 𝛾𝑖 satisfy 0<𝛾1<𝛾2<<𝛾𝐽2. In the weighted SLA, 𝐰𝑖 almost surely converges to the eigenvectors of 𝐂. The weighted SLA can perform PCA; however, norms of the weight vectors are not equal to unity.

SLA and weighted SLA are nonlocal algorithms, and they rely on the calculation of the errors and the backward propagation of the values between the layers. A PCA algorithm is obtained by adding a term to SLA [7] so as to rotate the basis vectors in the principal subspace toward the principal eigenvectors [24]. The adaptive learning algorithm (ALA) [20] is also a PCA algorithm that is based on SLA. In ALA, each neuron adaptively updates its learning rate by 𝜂𝑖̂𝜆(𝑡)=𝛽(𝑡)/𝑖(𝑡), where ̂𝜆𝑖(𝑡) is the estimated eigenvalue and can be estimated using (12), 𝛽(𝑡) is set to be smaller than 2(21) and decreases to zero as 𝑡. All 𝐰𝑖(𝑡) will quickly converge, at nearly the same rate, to 𝐜𝑖 in the order of descending eigenvalues. The performance is better than that of the generalized Hebbian algorithm (GHA) [8]. SLA has been extended in [25] so as to extract a noise robust projection.

4.2. Generalized Hebbian Algorithm

By combining Oja’s rule and the GSO procedure, Sanger proposed GHA for extracting the first 𝐽2 PCs [8]. GHA can extract the first 𝐽2 eigenvectors in the order of decreasing eigenvalues.

The GHA is given by [8] 𝐰𝑖(𝑡+1)=𝐰𝑖(𝑡)+𝜂𝑖(𝑡)𝑦𝑖𝐱(𝑡)𝑡̂𝐱𝑖̂𝐱(𝑡),(14)𝑖(𝑡)=𝑖𝑗=1𝐰𝑗(𝑡)𝑦𝑗(𝑡),(15) for 𝑖=1,,𝐽2. GHA becomes a local algorithm by rewriting the summation term in (15) in a recursive form ̂𝐱𝑖̂𝐱(𝑡)=𝑖1(𝑡)+𝐰𝑖(𝑡)𝑦𝑖(𝑡),𝑖=1,,𝐽2,(16) where ̂𝐱0(𝑡)=0. Usually 𝜂𝑖(𝑡) is selected as the same for all neurons. In GHA, the 𝑚th neuron converges to the 𝑚th PC, and all neurons tend to converge together. 𝐰𝑖𝐜𝑖 and 𝐸[𝑦2𝑖]𝜆𝑖, as 𝑡.

Both SLA [7, 22] and GHA [8] employ implicit or explicit GSO to decorrelate the connection weights. The weighted SLA [22] performs well for extracting less dominant components.

4.3. Other Hebbian Rule-Based Algorithms

In addition to the popular SLA, weighted SLA and GHA algorithm, there are some other Hebbian rule-based PCA algorithms such as local LEAP (learning machine for adaptive feature extraction via principal component analysis) [26], the nonlocal dot-product-decorrelation (DPD) rule [27], and the local invariant-norm PCA [28].

The LEAP algorithm [26] is a local PCA algorithm for extracting all the 𝐽2 PCs and their corresponding eigenvectors. Unlike SLA [7] and GHA [8], whose stability analysis is based on the stochastic approximation theory [6], the stability analysis of LEAP is based on Lyapunov’s first theorem, and as such 𝜂 can be selected as a small positive constant [26]. Due to a constant learning rate, LEAP is capable of tracking nonstationary processes. LEAP can satisfactorily extract PCs even for ill-conditioned autocorrelation matrices [26].

The DPD algorithm is a nonlocal PCA algorithm [27]. It moves 𝐰𝑖, 𝑖=1,,𝐽2, towards the 𝐽2 principal eigenvectors 𝐜𝑖, ordered arbitrarily. The algorithm induces the norms of the weight vectors towards the corresponding eigenvalues, that is, 𝐰𝑖(𝑡)𝜆𝑖(𝑡), as 𝑡. The algorithm breaks the symmetry in its learning process by the difference in the norms of the weight vectors while keeping the symmetry in its structure. The algorithm is as fast as the GHA [8], weighted SLA [22], and least mean squared error reconstruction (LMSER) [29] algorithms.

5. Least Mean Squared Error-Based Principal Component Analysis

Existing PCA algorithms including the Hebbian rule-based algorithms can be derived by optimizing an objective function using the gradient-descent method. The least mean squared error- (LMSE-) based methods are derived from the modified MSE function 𝐸(𝐖)=𝑡𝑡1=1𝜇𝑡𝑡1𝐱𝑡1𝐖𝐖𝑇𝐱𝑡12,(17) where 0<𝜇1 is a forgetting factor used for nonstationary observation sequences, and 𝑡 is the current instant. Many adaptive PCA algorithms actually optimize (17) by using the gradient-descent method [29, 30] and the RLS method [3034].

The gradient-descent or Hebbian rule-based algorithms are highly sensitive to 𝜂. RLS-based algorithms such as adaptive principal components extraction (APEX) [35], Kalman-type RLS [31], projection approximation subspace tracking (PAST) [30], PAST with deflation (PASTd) [30], and the robust RLS algorithm (RRLSA) [33] can overcome the drawback. All RLS-based PCA algorithms exhibit fast convergence and high tracking accuracy and are suitable for slowly varying nonstationary vector stochastic processes. However, RLS method may cause instability in certain cases. All these algorithms correspond to a three-layer 𝐽1-𝐽2-𝐽1 linear autoassociative network model, and they can extract all the 𝐽2 PCs in the descending order of the eigenvalues, where a GSO-like orthonormalization procedure is used.

In [36], a regularization term 𝜇𝑡𝐰𝑇𝐏01𝐰 is added to (17), where 𝐰 is a stack vector of 𝐖 and 𝐏0 is a diagonal matrix with dimension 𝐽1𝐽2×𝐽1𝐽2. As 𝑡 is sufficiently large, this term is negligible. This term ensures that the entries of 𝐖 do not become too large. The Gauss-Seidel recursive PCA and Jacobi recursive PCA algorithms are derived in [36].

The LMSER algorithm is derived on the MSE criterion using the gradient-descent method [29]. LMSER reduces to Oja’s SLA algorithm when 𝐖(𝑡) is orthonormal, namely, 𝐖𝑇(𝑡)𝐖(𝑡)=𝐈. Oja’s algorithm can thus be treated as an approximate stochastic gradient rule to minimize the MSE. LMSER [29] has been compared with the weighted SLA [22] and GHA [8] in [37]. LMSER [29] uses nearly twice as much computation as weighted SLA [22] and GHA [8], for each update of the weight. However, it leads to a smaller asymptotic MSE and faster convergence for the minor eigenvectors [37].

PASTd [30] is a well-known subspace tracking algorithm updating the signal eigenvectors and eigenvalues. PASTd is based on PAST. Both PAST and PASTd are derived for complex-valued signals. Both PAST and PASTd have linear computational complexity, that is, 𝑂(𝐽1𝐽2) operations every update, as in the cases of SLA [14], GHA [8], LMSER [29], and novel information criterion (NIC) [32]. PAST computes an arbitrary basis of the signal subspace, while PASTd is able to update the signal eigenvectors and eigenvalues. Both the algorithms produce nearly orthonormal, but not exactly orthonormal, subspace basis or eigenvector estimates. If perfectly orthonormal eigenvector estimates are required, an orthonormalization procedure is necessary.

Kalman-type RLS [31] combines the basic RLS algorithm with the GSO procedure. Kalman-type RLS and PASTd are exactly identical if the inverse of the covariance of the output of the 𝑖th neuron in Kalman-type RLSA takes a special value. In the one-unit case, both PAST and PASTd reduce to Oja’s learning rule [14]. Both PAST and PASTd provide much more robust estimates than eigenvalue decomposition (EVD) and converge much faster than SLA [14]. PASTd has been extended for tracking both the rank and the subspace by using some information theoretic criteria [38].

RRLSA [33] is more robust than PASTd [30]. RRLSA can be implemented in a sequential or parallel form. RRLSA has the flexibility as Kalman-type RLS [31], PASTd [30], APEX [35] in that increasing the number of neurons does not affect the previously extracted principal components. RRLSA naturally selects the inverse of the output energy to be the adaptive learning rate for the Hebbian rule. The Hebbian and Oja rules are closely related to the RRLSA algorithm by suitable selection of the learning rates [33]. RRLSA [33] is also robust to the error accumulation from the previous components, which exists in the sequential PCA algorithms like Kalman-type RLS [31] and PASTd [30]. RRLSA converges fast, even if the eigenvalues extend over several orders of magnitude. According to the empirical results [33], RRLSA provides the best performance in terms of the convergence speed as well as the steady-state error, whereas Kalman-type RLS and PASTd have similar performance, which is inferior to that of RRLSA, and ALA [20] exhibits the poorest performance.

6. Other Optimization-Based Principal Component Analysis

PCA can be derived by any optimization method based on a proper objective function. This leads to many other algorithms, including gradient-descent based algorithms [15, 16, 39, 40], the conjugate gradient (CG) method [41], and the quasi-Newton method [42, 43]. The gradient-descent method usually converges to a local minimum. Some adaptive algorithms derived from the gradient descent, conjugate direction, and Newton-Raphson methods, whose simulation results are better than that of the gradient-descent method [29], have also been proposed in [44]. Second-order algorithms such as the CG [41] and quasi-Newton methods [42] typically converge much faster than first-order methods but have a computational complexity of 𝑂(𝐽21𝐽2) per iteration.

The infomax principle [16] was first proposed by Linsker to describe a neural network algorithm. The principal subspace is derived by maximizing the mutual information criterion.

The NIC algorithm [32] is obtained by applying the gradient-descent method to maximize the NIC, a cost function that is very similar to the mutual information criterion [16, 45] but integrates a soft constraint on the weight orthogonalization. The NIC has a steep landscape along the trajectory from a small weight matrix to the optimum one. 𝐸NIC has a single global maximum, and all the other stationary points are unstable saddle points. At the global maximum, 𝐖 yields an arbitrary orthonormal basis of the principal subspace. The NIC algorithm has a computational complexity of 𝑂(𝐽21𝐽2) for each iteration.

The NIC algorithm is a PSA method. It can extract the principal eigenvectors when the deflation technique is incorporated. The NIC algorithm converges much faster than SLA [22] and LMSER [29] and is able to globally converge to the PSA solution from almost any weight initialization. Reorthonormalization can be applied so as to perform true PCA [30, 32]. An RLS version of the NIC algorithm is given in [32]. The PAST algorithm [30] is a special case of the NIC algorithm when 𝜂 takes unity. The weighted information criterion (WINC) [34] is obtained by adding to the NIC a weight to break the symmetry in the NIC. Two WINC algorithms are derived by using gradient ascent and RLS, respectively. The gradient ascent-based WINC algorithm can be viewed be a weighted SLA [23] with an adaptive step size, leading to a much faster convergence speed. The RLS-based WINC algorithm not only provides fast convergence and high accuracy but also has low computational complexity.

Most popular PCA or MCA algorithms do not consider eigenvalue estimates in the update equations of the weights, and they suffer from the stability-speed problem. The convergence speed of a system depends on the eigenvalues of its Jacobian. In PCA algorithms, the eigenmotion depends on the principal eigenvalue of the covariance matrix, while in MCA algorithms on all the eigenvalues [46]. Coupled learning rules can be derived by applying the Newton method to a common information criterion. In coupled PCA/MCA algorithms [46], both the eigenvalues and the eigenvectors are simultaneously adapted. The Newton method yields averaged systems with identical speed of convergence in all eigendirections.

In order to extract multiple PCs, one has to apply an orthonormalization procedure, like GSO, or its first-order approximation as used in SLA [7, 22], or deflation as in GHA [8]. In the coupled learning rules, multiple PCs are simultaneously estimated by a coupled system of equations. In the coupled learning rules a first-order approximation of GSO is superior to the standard deflation procedure in terms of the orthonormality error and the quality of the eigenvectors and eigenvalues generated [47].

7. Anti-Hebbian Rule-Based Principal Component Analysis

The anti-Hebbian learning rule updates a synaptic weight by the same amount as that in the Hebbian rule [9] but in the opposite direction. The anti-Hebbian rule can be used to remove correlations between units receiving correlated inputs [13, 48, 49]. The anti-Hebbian rule is inherently stable [13, 49].

Anti-Hebbian rule-based PCA algorithms can be derived by using a 𝐽1-𝐽2 feedforward architecture with lateral connections among the output units [13, 48, 49]. The lateral connections can be in a symmetrical or hierarchical topology. The hierarchical lateral connection topology is illustrated in Figure 2, based on which Rubner-Tavan PCA [13, 49] and APEX [50] algorithms are proposed. In [48], the local PCA algorithm is based on a symmetrical lateral connection topology. In addition to the feedforward weights 𝐖, the lateral weight matrix 𝐔=[𝐮1𝐮𝐽2] is a 𝐽2×𝐽2 matrix, where 𝐮𝑖=(𝑢1𝑖,𝑢2𝑖,,𝑢𝐽2𝑖)𝑇 contains all the lateral weights connected to neuron 𝑖 and 𝑢𝑗𝑖 denotes the lateral weight from unit 𝑗 to unit 𝑖.

The Rubner-Tavan PCA algorithm is based on the PCA network with hierarchical lateral connection topology [13, 49]. The algorithm extracts the first 𝐽2 PCs in the decreasing order of the eigenvalues.

The weights 𝐰𝑖 are trained by Oja’s rule, while the lateral weights 𝐮𝑖 are updated by the anti-Hebbian rule. This is a nonlocal algorithm. During the training process, the outputs of the neurons are gradually uncorrelated and the lateral weights approach zero. The network should be trained until the lateral weights 𝐮𝑖 are below a specified level. The PCA algorithm proposed in [48] has the same form as Rubner-Tavan PCA, but 𝐔 is a full matrix.

The APEX algorithm is used to adaptively extract the PCs [50]. The algorithm is recursive and adaptive, namely, given 𝑖1 PCs, it can produce the 𝑖th PC iteratively. The hierarchical structure of lateral connections serves the purpose of weight orthogonalization and also allows the network to grow or shrink without retraining the old units. APEX is proved to have the property of exponential convergence [50].

Assuming that the correlation matrix 𝐂 has distinct eigenvalues arranged in the decreasing order as 𝜆1>𝜆2>>𝜆𝐽2 with the corresponding eigenvectors 𝐰1,,𝐰𝐽2, the algorithm is given as [35, 50] 𝐲=𝐖𝑇𝑦𝐱,𝑖=𝐰𝑇𝑖𝐱+𝐮𝑇𝐲,(18) where 𝐲=(𝑦1,,𝑦𝑖1)𝑇 is the output vector, 𝐮=(𝑢1𝑖,𝑢2𝑖,,𝑢(𝑖1)𝑖)𝑇, and 𝐖=[𝐰1𝐰𝑖1] is the weight matrix of the first 𝑖1 neurons. The iteration is given as [35, 50] 𝐰𝑖(𝑡+1)=𝐰𝑖(𝑡)+𝜂𝑖𝑦(𝑡)𝑖(𝑡)𝐱𝑡𝑦2𝑖(𝑡)𝐰𝑖,𝐮(𝑡)(19)(𝑡+1)=𝐮(𝑡)𝜂𝑖𝑦(𝑘)𝑖(𝑡)𝐲(𝑡)+𝑦2𝑖.(𝑡)𝐮(𝑡)(20) Equation (19) is the Hebbian part, and (20) the anti-Hebbian part. 𝑦𝑖 tends to be orthogonal to all the previous components due to the anti-Hebbian rule, also called orthogonalization rule. APEX can also be derived from the RLS method using the MSE criterion. Based on the RLS method, an optimum learning rate in terms of convergence speed is given by 𝜂𝑖(𝑡)=(1𝜇)/𝜆𝑖, where 0<𝜇1 is a forgetting factor [35].

A desirable number of neurons can be decided during the learning process. When the environment is changing over time, a new PC can be added to compensate for the change without affecting the previously computed PCs, and the network structure can be expanded if necessary. 𝐰𝑖 converges to the eigenvector of the correlation matrix 𝐂 corresponding to the 𝑖th largest eigenvalue, and 𝐮 converges to zero.

For growing each additional PC, APEX requires 𝑂(𝐽1) per iteration, while GHA requires 𝑂(𝐽1𝐽2) per iteration.

A class of learning algorithms, called 𝜓-APEX, is presented based on a criterion optimization [51, 52]. 𝜓 can be selected as an arbitrary function that guarantees the stability of the network. Some members in the class have better numerical performance and require less computational effort compared to that of both GHA and APEX.

Most existing linear complexity methods including GHA [8], SLA [7], and PCA with the lateral connections [13, 35, 4850] require a computational complexity of 𝑂(𝐽1𝐽2) per iteration.

8. Nonlinear Principal Component Analysis

PCA is based on the Gaussian assumption for data distribution, and the optimality of PCA results from taking into account only the second-order statistics. For non-Gaussian data distributions, PCA is not able to capture complex nonlinear correlations, and nonlinear processing of the data is usually more efficient. Nonlinearities introduce higher-order statistics into the computation in an implicit way. Higher-order statistics, defined by cumulants, are needed for a good characterization of non-Gaussian data. PCA can be generalized to distributions of the exponential family [53].

When the feature space is nonlinearly related to the input space, we need to use nonlinear PCA. The outputs of nonlinear PCA networks are usually more independent than their respective linear cases. For non-Gaussian input data, a nonlinear PCA permits the extraction of higher-order components and provides a sufficient representation.

Kernel PCA [3, 54] is a special, linear algebra-based nonlinear PCA, which introduces kernel functions into PCA. The kernel PCA first maps the original input data into a high-dimensional feature space using the kernel method and then calculates PCA in the high-dimensional feature space. It is much more complicated and may sometimes be caught more easily in local minima. PCA needs to deal with an eigenvalue problem of a 𝐽1×𝐽1 matrix, while kernel PCA needs to solve an eigenvalue problem of an 𝑁×𝑁 matrix. Sparse approximation methods can be applied to reduce the computational cost [3].

In order to increase the robustness of PCA against outliers, a robust version of the covariance matrix based on the 𝑀-estimator can be used. Several popular PCA algorithms have been generalized into robust versions by applying statistical physics approach [55], where the defined objective function can be regarded as a soft generalization of an 𝑀-estimator [56]. Robust PCA can be defined so that the optimization criterion grows less than quadratically with the same constraint conditions as those for PCA, which are based on the quadratic criterion [57]. This usually leads to mildly nonlinear algorithms, in which the nonlinearities appear at selected places only and at least one neuron produces the linear response 𝑦𝑖=𝐱𝑇𝐰𝑖. When all the neurons generate nonlinear responses 𝑦𝑖=𝜙(𝐱𝑇𝐰𝑖), the algorithm is referred to as nonlinear PCA. The robust or nonlinear PCA algorithms are derived by using the gradient-descent method [57]. They can be treated as generalization of SLA [7, 22] and the GHA [8]. Robust and nonlinear PCA algorithms have better stability properties than the corresponding PCA algorithms if the (odd) nonlinearity 𝜑(𝑥) grows less than linearly, namely, |𝜑(𝑥)|<|𝑥| [57]. On the contrary, nonlinearities growing faster than linearly cause stability problems easily and are not recommended. These extensions are also introduced in [29, 58, 59].

The multilayer perceptron (MLP) can be used to perform nonlinear dimensionality reduction and hence nonlinear PCA. Both the input and output layers of the MLP have 𝐽1 units, and one of its hidden units, known as the bottleneck or representation layer, have 𝐽2 units, 𝐽2<𝐽1. The network is trained to reproduce its input vectors themselves. This kind of networks is called the autoassociative network. After the network is trained, it performs a projection onto the 𝐽2-dimensional subspace spanned by the first 𝐽2 PCs of the data. The vectors of weights leading to the hidden units form a basis set which spans the principal subspace, and data compression therefore occurs in the bottleneck layer. Many applications of the autoassociative MLP for PCA are available in the literature [6063].

Kramer’s nonlinear PCA network [62] is a five-layer autoassociative MLP. It has 𝐽1 input and 𝐽1 output nodes. The third layer has 𝐽2 nodes. Nonlinear activation functions such as the sigmoidal functions are used in the second and fourth layers, while the nodes in the bottleneck and output layers usually have linear activation functions, although they can be nonlinear. The network is trained by the backpropagation (BP) algorithm [10, 64]. Kramer’s nonlinear PCA fits a lower-dimensional surface through the training data. Usually, the data compression achieved in the bottleneck layer in such networks is somewhat better than that provided by the PCA solution [65]. However, the BP algorithm is prone to local minima and often requires excessive time for convergence.

A hierarchical nonlinear PCA network composed of a number of independent subnetworks can extract ordered nonlinear PCs [66]. Each subnetwork extracts one PC and has at least five layers. The subnetworks can be selected as Kramer’s nonlinear PCA network and are hierarchically arranged and trained. This network constructs the extraction functions in the order of the reconstruction efficiency as to the objective data. The number of PCs to be extracted is not required to be known in advance.

A hybrid hetero/autoassociative network [67] is constructed with a set of autoassociative outputs and a set of heteroassociative outputs. Both sets of output nodes are fully connected to the same bottleneck layer. The improvement over an autoassociative network or the PCA can be attributed to the reorganization done by the network in the representation layer space. The self-organizing map (SOM) [68] is a competitive learning-based neural network. It is capable of performing dimensionality reduction on the input. The SOM is inherently nonlinear and is viewed as a nonlinear PCA [69]. The adaptive subspace SOM (ASSOM) [70, 71] can be treated as a hybrid of VQ and PCA.

9. Minor Component Analysis

MCA, as a variant of PCA, is to find the smallest eigenvalues and their corresponding eigenvectors of the autocorrelation matrix 𝐂 of the signals. MCA is closely associated with the curve and surface fitting under the total least squares (TLSs) criterion [72]. MCA provides an alternative solution to the TLS problem [73]. The TLS technique achieves a better global optimal objective than the LS technique [1]. Both the TLS and LS problems can be solved by SVD. However, the TLS technique is computationally much more expensive than the least squares (LSs) technique [74]. MCA is useful in many fields including spectrum estimation, optimization, TLS parameter estimation in adaptive signal processing, and eigen-based bearing estimation.

The anti-Hebbian learning rule and its normalized version can be used for MCA [75], but both may lead to infinite magnitudes of weights [76]. To avoid this, one can renormalize the weight vector at each iteration. The constrained anti-Hebbian learning algorithm [73, 77] has a simple structure and requires a low computational complexity per update. However, the convergence of the magnitudes of the weights cannot be guaranteed either unless the initial weights take special values. The total least mean squares (TLMS) algorithm [74] is a random adaptive algorithm for extracting the MC, which has an equilibrium point under persistent excitation conditions. The TLMS requires about 4𝐽1 multiplications per iteration, which is twice the complexity of the LMS [78]. An adaptive step-size learning algorithm [79] has been derived for extracting the MC by introducing information criterion. The algorithm globally converges asymptotically to the MC and its corresponding eigenvector. The algorithm outperforms the TLMS [74] in terms of both the convergence speed and the estimation accuracy.

Minor components (MCs) can be extracted in ways similar to that for PCs. A simple idea is to reverse the sign of the PCA algorithms. This is because in many algorithms PCs correspond to the maximum of a cost function, while MCs correspond to the minimum of the same cost function. However, this idea does not work in general and has been discussed in [22]. Oja’s minor subspace analysis (MSA) algorithm can be formulated by reversing the sign of the learning rate of the SLA. This algorithm requires the assumption that the smallest eigenvalue of the autocorrelation matrix 𝐂 is less than unity. However, Oja’s MSA algorithm is known to diverge [22, 80, 81]. The bigradient PSA algorithm [82] is a modification to SLA [7] and is obtained by introducing an additional bigradient term embodying the orthonormal constraints of the weights, and it can be used for MSA by reversing the sign of 𝜂.

A general algorithm that can extract, in parallel, principal and minor eigenvectors of arbitrary dimensions is derived based on the natural-gradient method [83]. The difference between PCA and MCA lies in the sign of the learning rate. The MCA algorithm proposed in [83] suffers from a marginal instability, and thus it requires intermittent normalization such that 𝐰𝑖=1 [80]. A self-stabilizing MCA algorithm has been proposed in [80], such that none of 𝐰𝑖(𝑡) deviates significantly from unity. It diverges for PCA when 𝜂 is changed to +𝜂.

The orthogonal Oja (OOja) algorithm consists of Oja’s MSA [7] plus an orthogonalization of 𝐖(𝑡) at each iteration [84], 𝐖𝑇(𝑡)𝐖(𝑡)=𝐈. In this case, the above algorithms given in [7, 80, 83] are equivalent. The OOja is numerically very stable. By reversing the sign of 𝜂, we extract 𝐽2 PCs. The normalized Oja (NOja) is derived by optimizing the MSE criterion subject to an approximation to the orthonormal constraint [81]. This leads to the optimal learning rate. The normalized orthogonal Oja (NOOja) is an orthogonal version of the NOja such that the orthonormal constraint is perfectly satisfied [81]. Both algorithms offer, as compared to Oja’s SLA, a faster convergence, orthogonality, and a better numerical stability with a slight increase in the computational complexity. By switching the sign of 𝜂 in given learning algorithms, both the NOja and the NOOja can be used for the estimation of minor and principal subspaces of a vector sequence.

The above algorithms including Oja’s MSA [7], the natural-gradient-based method [83], self-stabilizing MCA [80], OOja, NOja, and NOOja have a complexity of 𝑂(𝐽1𝐽2) [80, 84]. OOja, NOjia, and NOOjia require less computation load than the natural-gradient-based method [83], self-stabilizing MCA [80, 81, 84].

By using the Rayleigh quotient as an energy function, invariant-norm MCA [85] is analytically proved to converge to the first MC of the input signals. However, invariant-norm MCA [85] leads to divergence in finite time [76], and this drawback can be eliminated by renormalizing the weight vector at each iteration. In [86], an MCA algorithm for extracting multiple MCs is described by using the idea of sequential addition; a conversion method between MCA and PCA is also discussed. Based on a generalized differential equation for the generalized eigenvalue problem, a class of algorithms can be obtained for extracting the first PC or MC by selecting different parameters and functions [87]. Many existing PCA and MCA algorithms are special cases of this class. All the algorithms of this class have the same order of convergence speed and are robust to implementation error. A rapidly convergent quasi-Newton method has been applied to extract multiple MCs in [88]. The proposed algorithm has a complexity of 𝑂(𝐽21𝐽2) but with a quadratic convergence. It makes use of the implicit orthogonalization procedure that is built into it through an inflation technique.

10. Localized Principal Component Analysis

The nonlinear PCA problem can be solved by partitioning the data space into a number of disjunctive regions and then estimating the principal subspace within each partition by linear PCA. This is the so-called localized PCA. The distribution is collectively modeled by a collection or a mixture of linear PCA models, each characterizing a partition. Most natural data sets have large eigenvalues in only a few eigendirections, while the variances in other eigendirections are so small as to be considered as noise. The localized PCA method provides an efficient means to decompose high-dimensional data compression problems into low-dimensional data compression problems. The localized PCA method is commonly used in image compression [8]. An image is often first transformation coded by PCA, and then the coefficients are quantized.

VQ-PCA [65] is a locally linear model that uses VQ to define the Voronoi regions for localized PCA. The algorithm builds a piecewise linear model of the data. It performs better than the global models implemented by PCA model and Kramer’s nonlinear PCA and is significantly faster than Kramer’s nonlinear PCA [65]. Adaptive combination of PCA and VQ is given in [89], where an autoassociative network is used to perform PCA and simple competitive learning is used to perform VQ. The error between the input and output of the autoassociative network is fed to the VQ network. The network produces better results than by using the two algorithms successively. An online localized PCA algorithm [90] is developed by extending the neural gas method [91]. Instead of the Euclidean distance measure, a combination of a normalized Mahalanobis distance and the squared reconstruction error guides the competition between the units. Weighting between the two measures is determined from the residual variance in the minor subspace of each submodel. The unit centers are updated as in neural gas, while subspace learning is based on the RRLSA algorithm [33].

Similar to localized PCA, localized ICA is used to characterize nonlinear ICA. Clustering is first used for an overall coarse nonlinear representation of the underlying data and linear ICA is then applied in each cluster so as to describe local features of the data [92]. This leads to a better representation of the data than in linear ICA. ASSOM [93] is another localized PCA for unsupervised extraction of invariant local features from the input data. ASSOM associates a subspace instead of a single weight vector to each node of the SOM. The subspaces in ASSOM can be formed by applying ICA [94].

11. Extending to Complex Domain

Complex PCA is a generalization of PCA in complex-valued data sets [95]. Complex PCA has been widely applied to complex-valued data and two-dimensional vector fields. Complex PCA employs the same neural network architecture as that of PCA, but with complex weights. The objective functions for PCA can also be adapted to complex PCA by changing the transpose into the Hermitian transpose. For example, for complex PCA, complex PCs can be extracted by minimizing the MSE function 1𝐸=𝑁𝑁𝑖=1𝐳𝑖𝐖𝐖𝐻𝐳𝑖2,(21) where 𝐳𝑖 is the 𝑖th input complex vector.

In [96], a complex-valued neural network model is developed for nonlinear complex PCA. Nonlinear complex PCA has the ability to extract nonlinear features missed by PCA. It uses the architecture of Kramer’s nonlinear PCA network [62], but with complex weights and biases. For a similar number of model parameters, the nonlinear complex PCA model captures more variance of a data set than the alternative real approach, where each complex variable is replaced by two real variables and is applied to Kramer’s nonlinear PCA. The complex nonlinear transfer function is selected as tanh(𝑧) with |𝑧|<𝜋/2 [97]. To limit the modulus of the net input of a neuron |𝐰𝐻𝐳| to be less than 𝜋/2, one can initialize the algorithm with weights and biases of small magnitudes and use a weight penalty in the objective function. A complex-valued BP or quasi-Newton algorithm can be used for training.

There are many other complex PCA algorithms. Both PAST and PASTd are, respectively, the PSA and PCA algorithms derived for complex-valued signals [30]. A heuristic complex extension of GHA [8] and APEX [35] is, respectively, given in [98, 99]. The robust complex PCA algorithms have also been derived in [100] for hierarchically extracting PCs of complex-valued signals based on a robust statistics-based loss function. Concerning complex MCA, the constrained anti-Hebbian algorithm [73, 77] has been extended for the complex-valued TLS problem [77] and has been applied to adaptive FIR and IIR filtering. The adaptive invariant-norm MCA algorithm [85] has been generalized to the case for complex-valued input signal vector 𝐱(𝑡). For ICA algorithms, FastICA has been applied to complex signals [101]. The 𝜓-APEX algorithms and GHA are, respectively, extended to the complex-valued case [102, 103]. Based on a suitably selected nonlinear function, these algorithms can be used for BSS of complex-valued circular source signals.

12. Other Generalizations of PCA

Simple neural models, described by differential equations, are derived in [104, 105] to calculate the largest and smallest eigenvalues as well as their corresponding eigenvectors of any real symmetric matrix. Supervised PCA [106, 107] is achieved by augmenting the input of the PCA with the class label of the data set. Given a sample covariance matrix, we examine the problem of maximizing the variance explained by a linear combination of the input variables while constraining the number of nonzero coefficients in this combination. This is known as sparse PCA [108]. Constrained PCA, generalized EVD, and the two-dimensional PCA are three important generalizations to PCA.

12.1. Constrained Principal Component Analysis

When certain subspaces are less preferred than others, this yields constrained PCA [109]. The optimality criterion for constrained PCA is variance maximization, as in PCA, but with an external subspace orthogonality constraint that extracted PCs are orthogonal to some undesired subspaces. PCA usually obtains the best fixed-rank approximation to the data in the LS sense. On the other hand, constrained PCA allows specifying metric matrices that modulate the effects of rows and columns of a data matrix. This actually is the weighted LS estimation. Constrained PCA first decomposes the data matrix by projecting the data matrix onto the spaces spanned by matrices of external information and then applies PCA to decomposed matrices, which involves the generalized SVD. The APEX algorithm has been applied to recursively solve the constrained PCA problem [35].

The constrained PAST algorithm [110] is for tracking the signal subspace recursively. Based on an interpretation of the signal subspace as the solution of a constrained minimization task, it guarantees the orthonormality of the estimated signal subspace basis at each update, hence avoiding orthonormalization process. To reduce the computational complexity, fast constrained PAST is introduced which has 𝑂(𝐽1𝐽2) complexity. A signal subspace rank estimator is employed to track the number of sources.

12.2. Generalized Eigenvalue Decomposition

Generalized EVD is a statistical tool extremely useful in feature extraction, pattern recognition as well as signal estimation and detection. The generalized EVD problem involves the matrix equation 𝐑1𝐰𝑖=𝜆𝑖𝐑2𝐰𝑖,𝑖=1,,𝐽2,(22) where 𝐑1, 𝐑2𝑅𝐽1×𝐽1, and 𝜆𝑖, 𝐰𝑖 are, respectively, the 𝑖th generalized eigenvalue and its corresponding generalized eigenvector. For real symmetric and positive definite matrices, all the generalized eigenvectors are real and the corresponding generalized eigenvalues are positive.

Generalized EVD achieves simultaneous diagonalization of 𝐑1 and 𝐑2: 𝐖𝑇𝐑1𝐖=Λ,𝐖𝑇𝐑2𝐖=𝐈,(23) where 𝐖=[𝐰1,,𝐰𝐽2] and Λ=diag(𝜆1,,𝜆𝐽2). Typically, 𝐑1 and 𝐑2 are, respectively, the full covariance matrices of zero-mean stationary random signals 𝐱1, 𝐱2𝑅𝐽1. In this case, iterative generalized EVD algorithms can be obtained by using two PCA steps. When 𝐑2 becomes an identity matrix, the generalized EVD reduces to PCA.

Any generalized eigenvector 𝐰𝑖 is a stationary point of the criterion function 𝐸GEVD𝐰(𝐰)=𝑇𝐑1𝐰𝐰𝑇𝐑2𝐰.(24) The LDA problem is a typical generalized EVD problem. The three-layer LDA network [111] is obtained by concatenating two Rubner-Tavan PCA subnetworks, each being trained by the Rubner-Tavan PCA algorithm [13, 49]. Based on the Rubner-Tavan PCA network architecture, online local learning algorithms for LDA and generalized EVD are given in [112]. There are also a number of adaptive methods for generalized EVD such as LDA-based gradient descent [113, 114], a quasi-Newton type generalized EVD [115], an RLS-like fixed-point generalized EVD algorithm [116], error-correction learning [112], and Hebbian learning [112]. All these algorithms first extract the principal generalized eigenvector and then estimate the minor generalized eigenvectors using a deflation procedure.

A recurrent network with invariant 𝐵-norm [117] computes the largest or smallest generalized eigenvalue and the corresponding eigenvector of any symmetric positive pair, which can be simply extended to compute the second largest or smallest generalized eigenvalue and the corresponding eigenvector. In [118], the proposed unconstrained quartic cost function based on the weighted rule has a unique global minimum, which corresponds to the principal generalized eigenvectors.

12.3. Two-Dimensional PCA

Two-dimensional PCA [5] is especially designed for image representation. An image covariance matrix is constructed directly using the original image matrices instead of the transformed vectors, and its eigenvectors are derived for image feature extraction. For 𝑚×𝑛 images, the size of the image covariance (scatter) matrix using two-dimensional PCA is 𝑛×𝑛, whereas, for PCA, the size is 𝑚𝑛×𝑚𝑛. This results in considerable computational advantage in two-dimensional PCA. Two-dimensional PCA evaluates the covariance matrix more accurately over PCA. When used for face recognition, two-dimensional PCA results in a better recognition accuracy. Two-dimensional PCA is a row-based PCA, and it only reflects the information between rows. Diagonal PCA [119] improves two-dimensional PCA by defining the image scatter matrix as the covariances between the variations of the rows and those of the columns of the images and is shown to be more accurate than PCA and two-dimensional PCA. 𝐿1-norm-based two-dimensional PCA [120] is a iterative two-dimensional generalization of 𝐿1-norm based PCA.

Bidirectional PCA [121] reduces the dimension in both column and row directions for image feature extraction. The feature dimension obtained is much less than that of two-dimensional PCA. Two-dimensional PCA can be regarded as a special bidirectional PCA. Bidirectional PCA has to be performed in batch mode. With the concepts of tensor, 𝑘-mode unfolding and matricization, an SVD-revision-based incremental learning method of bidirectional PCA [122] gives a close approximation to bidirectional PCA, but using less time.

The uncorrelated multilinear PCA algorithm [123] is used for unsupervised subspace learning of tensorial data. It is a multilinear extension of PCA. Through successive variance maximization, uncorrelated multilinear PCA seeks a tensor-to-vector projection that captures most of the variation in the original tensorial input while producing uncorrelated features. It is the only multilinear extension of PCA that can produce uncorrelated features in a fashion similar to that of PCA, in contrast to other multilinear PCA extensions, such as two-dimensional PCA [5] and multilinear PCA (MPCA) [124].

13. Singular Value Decomposition

Given two sets of random vectors with zero mean, {𝐱𝑡𝑅𝑛1} and {𝐲𝑡𝑅𝑛2}, the cross-correlation matrix is defined by 𝐂𝑥𝑦𝐱=𝐸𝑡𝐲𝑇𝑡=𝑛𝑖=1𝜎𝑖𝐯𝑥𝑖𝐯𝑦𝑖𝑇,(25) where 𝜎𝑖>0 is the 𝑖th singular value, 𝐯𝑥𝑖 and 𝐯𝑦𝑖 are its corresponding left and right singular vectors, and 𝑛=min{𝑛1,𝑛2}. The cross-correlation asymmetric PCA/MCA networks can be used to extract the singular values of the cross-correlation matrix of two stochastic signal vectors or to implement SVD of a general matrix.

Cross-correlation asymmetric PCA (APCA) network consists of two sets of neurons that are laterally hierarchically connected [125]. The topology of the network is shown in Figure 3. 𝐱 and 𝐲 are, respectively, the 𝑛1-dimensional and 𝑛2-dimensional input signals, the 𝑛1×𝑚 matrix 𝐖_=[𝐰_1𝐰_𝑚] and the 𝑛2×𝑚 matrix 𝐖=[𝐰1𝐰𝑚] are the feedforward weights, while the 𝑛2×𝑚 matrices 𝐔_=[𝐮_1𝐮_𝑚] and 𝐔=[𝐮1𝐮𝑚] are the lateral connection weights, where 𝐮_𝑖=(𝑢_1𝑖,,𝑢_𝑚𝑖)𝑇, 𝐮𝑖=(𝑢1𝑖,,𝑢𝑚𝑖)𝑇, and 𝑚min{𝑛1,𝑛2}. This model performs SVD of 𝐂𝑥𝑦 [125].

The network has the following relations: 𝐚=𝐖_𝑇𝐱,𝐛=𝐖𝑇𝐲,(26) where 𝐚=(𝑎1,,𝑎𝑚)𝑇 and 𝐛=(𝑏1,,𝑏𝑚)𝑇.

The objective function for extracting the first principal singular value of the covariance matrix is given by 𝐸APCA𝐰_,𝐰=𝐸𝑎1(𝑡)𝑏1(𝑡)𝐰_𝐰=𝐰_𝑇𝐂𝑥𝑦𝐰𝐰_𝐰.(27) It is an indefinite function. When 𝐲=𝐱, it reduces to PCA [14]. After the principal singular component is extracted, a deflation transformation is introduced to nullify the principal singular value so as to make the next singular value principal. Thus, 𝐂𝑥𝑦 in the criterion (27) can be replaced by a transformed form so as to extract the next principal singular component.

Using a deflation transformation, the two sets of neurons are trained with the cross-coupled Hebbian learning rules, which are given in [125, 126]. 𝐰_𝑖 and 𝐰𝑖 approximate the 𝑖th left and right principal singular vectors of 𝐂𝑥𝑦, respectively, and 𝜎𝑖 approximates its corresponding criterion 𝐸APCA, as 𝑡. The algorithm extracts the first 𝑚 principal singular values in the descending order and their corresponding left and right singular vectors. Like APEX, the APCA algorithm incrementally adds nodes without retraining the learned nodes. Exponential convergence has been demonstrated by simulation [125].

Based on the APCA network, the principal singular component of 𝐂𝑥𝑦 can be efficiently extracted by using a modification to the cross-coupled Hebbian rule with global asymptotic convergence [127]. This algorithm is extended for extracting the principal singular component of a general matrix 𝐀𝑅𝑛1×𝑛2 by replacing the cross-correlation matrix by a general nonsquare matrix [127]. Based on this and a deflation transformation, one can extract multiple principal singular components for the nonsquare matrix 𝐀 [127]. The algorithm can efficiently perform SVD of an ill-posed matrix. It can be used to solve the smallest singular component of the general matrix 𝐀 and is especially useful for TLS problems. Some adaptive SVD algorithms for subspace tracking of a recursively updated data matrix have been surveyed and proposed in [128].

Coupled learning rules for SVD produce better results than Hebbian learning rules. Combined with first-order approximation of GSO, precise estimates of singular vectors and singular values with only small deviations from orthonormality are produced. Double deflation is clearly superior to standard deflation but inferior to first-order approximation of GSO, both with respect to orthonormality and diagonalization errors. Coupled learning rules converge faster than Hebbian learning rules, and the first-order approximation of GSO produces more precise estimates and better orthonormality than standard deflation [129]. Many SVD algorithms are reviewed in [129].

Tucker decomposition [130] decomposes a three-dimensional signal directly using three-dimensional PCA, which is a multilinear generalization of SVD to multidimensional data. For video frames, this higher-order SVD decomposes the dynamic texture as a multidimensional signal (tensor) without unfolding the video frames on column vectors. This is a more natural and flexible decomposition, since it permits us to perform dimension reduction in the spatial, temporal, and chromatic domains between the pixels of the video sequence, leading to an important decrease in model size, while standard SVD allows for temporal reduction only. The analysis part is more expensive, but the synthesis has the same cost as existing algorithms [131].

14. Canonical Correlation Analysis

CCA [132], proposed by Hotelling in 1936, is a multivariate statistical technique. It makes use of two views of the same set of objects and projects them onto a lower-dimensional space in which they are maximally correlated. CCA seeks prominently correlated projections between two views of data, and it has been long known to be equivalent to LDA when the data features are used in one view and the class labels are used in the other view [133, 134]. In other words, LDA is a special case of CCA. CCA is equivalent to LDA for binary-class problems [134], and it can be formulated as an LS problem for binary-class problems.

CCA leads to a generalized EVD problem. Thus, we can employ a kernelized version of CCA to compute a flexible contrast function for ICA. Generalized CCA consists of a generalization of CCA to more than two sets of variables [135].

Given two centered random multivariables 𝐱𝑛𝑥 and 𝐲𝑛𝑦, the goal of CCA is to find a pair of directions 𝜔𝑥 and 𝜔𝑦 such that the correlation 𝜌(𝐱,𝐲) between the two projections 𝜔𝑇𝑥𝐱 and 𝜔𝑇𝑦𝐲 is maximized.

Suppose that we are given a sample of instances 𝑆={(𝐱1,𝐲1),,(𝐱𝑛,𝐲𝑛)} of (𝐱,𝐲). Let 𝑆𝑥 denote (𝐱1,,𝐱𝑛) and similarly 𝑆𝑦 denote (𝐲1,,𝐲𝑛). We can consider defining a new coordinate for 𝐱 by choosing direction 𝐰𝑥 and projecting 𝐱 onto that direction, 𝐱𝐰𝑇𝑥𝐱. If we do the same for 𝐲 by choosing direction 𝐰𝑦, we obtain a sample of the new mapping for 𝐲. Let 𝑆𝑥,𝐰𝑥=𝐰𝑇𝑥𝐱1,,𝐰𝑇𝑥𝐱𝑛,(28) with the corresponding values of the mapping for 𝐲 being 𝑆𝑦,𝐰𝑦=𝐰𝑇𝑦𝐲1,,𝐰𝑇𝑦𝐲𝑛.(29) The first stage of canonical correlation is to choose 𝐰𝑥 and 𝐰𝑦 to maximize the correlation between the two vectors 𝜌=max𝐰𝑥,𝐰𝑦𝑆corr𝑥𝐰𝑥,𝑆𝑦𝐰𝑦=max𝐰𝑥,𝐰𝑦𝑆𝑥𝐰𝑥,𝑆𝑦𝐰𝑦𝑆𝑥𝐰𝑥𝑆𝑦𝐰𝑦.(30)

After manipulation, we have 𝜌=max𝐰𝑥,𝐰𝑦𝐰𝑇𝑥𝐸𝐱𝐲𝑇𝐰𝑦𝐰𝑇𝑥𝐸𝐱𝐱𝑇𝐰𝑥𝐰𝑇𝑦𝐸𝐲𝐲𝑇𝐰𝑦=𝐰𝑇𝑥𝐂𝐱𝐲𝐰𝑦𝐰𝑇𝑥𝐂𝐱𝐱𝐰𝑥𝐰𝑇𝑦𝐂𝐲𝐲𝐰𝑦,(31) where the covariance matrix of (𝐱,𝐲) is defined by 𝐱𝐲𝐱𝐲𝐂(𝐱,𝐲)=𝐸𝑇=𝐂𝐱𝐱𝐂𝐱𝐲𝐂𝐲𝐱𝐂𝐲𝐲=𝐂.(32)

Under a mild condition which tends to hold for high-dimensional data, CCA in the multilabel case can be formulated as an LS problem [136]. Based on this, efficient algorithms for solving LS problems can be applied to scale CCA to very large data sets. In addition, several CCA extensions, including the sparse CCA formulation based on 𝐿1-norm regularization, are proposed [136]. The LS formulation of CCA and its extensions can be solved efficiently. The LS formulation is extended to orthonormalized partial least squares by establishing the equivalence relationship between CCA and orthonormalized partial least squares [136]. The CCA projection for one set of variables is independent of the regularization on the other set of multidimensional variables.

In [137], a strategy for reducing LDA to CCA is proposed. Within-class coupling CCA (WCCCA) is to apply CCA to pairs of data samples that are most likely to belong to the same class. Each one of the samples of a class, serving as the first view, is paired with every other samples of that class serving as the second view. The equivalence between LDA and such an application of CCA is proved.

Two-dimensional CCA seeks linear correlation based on images directly. Motivated by locality-preserving CCA [138] and spectral clustering, a manifold learning method called local two-dimensional CCA [139] identifies the local correlation by weighting images differently according to their closeness. That is, the correlation is measured locally, which makes local two-dimensional CCA more accurate in finding correlative information. Local two-dimensional CCA is formulated as solving generalized eigenvalue equations tuned by Laplacian matrices.

CCArc [140] is a two-dimensional CCA that is based on representing the image as the sets of its rows and columns and implementation of CCA using these sets. CCArc does not require preliminary downsampling procedure, it is not iterative and it is applied along the rows and columns of input image. Size of covariance matrices in CCArc is equal to max{𝑀,𝑁} Small-sample-size problem in CCArc does not occur, because we actually use 𝑁 images of size 𝑀×1 and 𝑀 images of size 𝑁×1; this always meets the condition max{𝑀,𝑁}<(𝑀+𝑁).

15. A Simulation Example

The concept of subspace is involved in many information processing problems. This requires EVD of the autocorrelation matrix of a data set or SVD of the cross-correlation matrix of two data sets. For example, in the area of array signal processing, the APCA algorithm is used in beamforming [126] and the MCA or PCA method is used in DoA estimation [141]. We have also illustrated weighted SLA, GHA, and APEX in [10]. We now provide an example to illustrate the application of PCA.

Image compression is usually implemented by partitioning an image into many nonoverlapping 8×8 pixel blocks and then compressing them one by one. Based on the statistics of all the regions, one can use PCA to compress the image [8, 10]. Each region is concatenated into a vector, and all the vectors constitute a training set. PCA is then applied to extract those prominent PCs, as such the image is compressed. Similar results for image compression have been reported in [11, 142] by using a three-layer autoassociative network with BP learning. PCA as well as LDA achieves the same results for an original data set and its orthonormally transformed version [143]; thus, both methods can be directly implemented in the DCT domain, and the results are exactly the same as that obtained from the spatial domain.

In this example, we use the APEX algorithm to train the PCA network and then use the trained network to encode other pictures. We use the benchmark Lina picture of 560×560 pixels as the training set, and a kid picture of 640×560, which has the similar statistics, is then used for generalization. We first use 8×8 blocks. For each square, only the first PC is significant, and all the other PCs can be ignored in terms of the quality of the restored picture. In this sense, we achieve a compression ratio of 1 : 64. However, the quality of the restored image is very poor. This is because PCA assumes the Guassian statistics of the data set and a real picture usually does not satisfy this assumption.

To improve the quality of the restored image, we employ nonoverlapping 4×4 blocks and again use the first PCs of the blocks. This achieves a compression ratio of 1 : 16. The Lena picture is thus transformed into a training set of 14400 vectors, and the kid picture is transformed into a validation set of 19200 vectors. The training set is so large that the APEX algorithm converges after only two epochs. The obtained codebook is 𝐰1 = (0.1949, 0.2222, 0.1920, 0.1747, 0.2255, 0.2065, 0.2195, 0.2246, 0.1972, 0.1701, 0.2268, 0.1908, 0.2224, 0.1996, 0.2140, 0.1783)𝑇. The Lena picture and its restored version are shown in Figure 4. The trained network is then used to compress the kid picture. Both the kid picture and its restored version are shown in Figure 5.

16. Summary

In this paper, we have discussed various neural network implementations and algorithms for PCA and its various extensions, including PCA, MCA, generalized EVD, constrained PCA, two-dimensional methods, localized methods, complex-domain methods, and SVD. These neural network methods provide an advantage over their conventional counterparts in that they are adaptive algorithms and have low computational as well as low storage complexity. These neural network methods find wide applications in pattern recognition, blind source separation, adaptive signal processing, and information compression. Two methods that are strongly associated with PCA, namely, ICA and LDA, are described here in passing.

16.1. Independent Component Analysis

ICA [144] is a statistical model that extends PCA. ICA has been widely used for BSS, feature extraction, and signal detection. For BSS applications, the ICA model is required to have model identifiability and separability [144]. The first neural network model related to the ICA was developed for online BSS of linearly mixed signals in [145].

ICA can extract the statistically independent components from the input data set. It is to estimate the mutual information between the signals by adjusting the estimated matrix to give outputs that are maximally independent. By applying ICA to estimate the independent input data from raw data, a statistical test can be derived to reduce the input dimension. The dimensions to remove are those that are independent of the output. In contrast, in PCA the dimensionality reduction is achieved by removing those dimensions that have a low variance.

Let a 𝐽1-vector 𝐱 denote a linear mixture and a 𝐽2-vector 𝐬, whose components have zero mean and are statistically mutually independent, denote the original source signals. The ICA data model can be written as 𝐱=𝐀𝐬+𝐧,(33) where 𝐀 is an unknown constant full-rank 𝐽1×𝐽2 mixing matrix and 𝐧 denotes the additive noise term, which is often omitted since it is usually impossible to separate noise from the sources. ICA takes one of three forms, namely, square ICA for 𝐽1=𝐽2, overcomplete ICA for 𝐽1<𝐽2, and undercomplete ICA for 𝐽1>𝐽2. While undercomplete ICA is useful for feature extraction, overcomplete ICA may be applied to signal and image processing methods based on multiscale and redundant basis sets.

The goal of ICA is to estimate 𝐬 by 𝐲=𝐖𝑇𝐱(34) such that the components of 𝐲, which is the estimate of 𝐬, are statistically as independent as possible, 𝐖 being a 𝐽1×𝐽2 demixing matrix. The statistical independence property implies that the joint probability density of the components of 𝐬 equals the product of the marginal densities of the individual components. Each component of 𝐬 is a stationary stochastic process, and only one of the components is allowed to be Gaussian distributed. The higher-order statistics of the original inputs are required for estimating 𝐬, rather than the second-order moment or covariance of the samples as used in PCA.

The demixing matrix 𝐖 of ICA is not orthogonal, while in PCA the components of the weights are represented on an orthonormal basis. ICA provides in many cases a more meaningful representation of the data than PCA. ICA can be realized by adding nonlinearity to linear PCA networks such that they are able to improve the independence of their outputs. In [146], an efficient ICA algorithm is derived by minimizing a nonlinear PCA criterion using the RLS approach. The three-layer (𝐽1-𝐽2-𝐽1) linear autoassociative network can also be used as an ICA network, as long as the outputs of the hidden layer are independent.

A well-known two-phase approach to ICA is to preprocess the data by PCA and then to estimate the necessary rotation matrix. A generic approach to ICA consists of preprocessing the data, defining measures of non-Gaussianity, and optimizing an objective function, known as a contrast function. Some measures of non-Gaussianity are kurtosis, differential entropy, negentropy, and mutual information, which can be derived from one another. Popular ICA algorithms include the Infomax, the natural-gradient, the equivariant adaptive separation via independence (EASI), and the FastICA algorithms [10]. FastICA is a well-known fixed-point ICA algorithm [147]. It is derived from the optimization of the kurtosis or the negentropy measure by using Newton’s method. FastICA achieves a reliable and at least a quadratic convergence. FastICA with symmetric orthogonalization and tanh nonlinearity is concluded as the best trade-off for ICA [10].

The ICA methods can be easily extended to the complex domain by using Hermitian transpose and complex nonlinear functions. In the context of BSS, the higher-order statistics are necessary only for temporally uncorrelated stationary sources. Second-order statistics-based source separation exploits temporally correlated stationary sources and the nonstationarity of the sources [148]. Many natural signals are inherently nonstationary with time-varying variances, since the source signals incorporate time delays into the basic BSS model.

Blind separation of the original signals in nonlinear mixtures has many difficulties such as the intrinsic indeterminacy, the unknown distribution of the sources as well as the mixing conditions, and the presence of noise. It is impossible to separate the original sources using only the source independence assumption of some unknown nonlinear transformations of the sources [149]. Nonlinear ICA can be modeled by a parameterized neural network whose parameters can be determined under the criterion of independence of its outputs. The inverse of the nonlinear mixing model can be modeled by using the three-layer MLP, the RBFN, the SOM, or the kernel-based nonlinear BSS method [10].

Nonnegativity is a natural condition for many real-world applications, for example, in the analysis of images, text, or air quality. Neural networks can be suggested by imposing a nonnegativity constraint on the outputs [29] or weights. Nonnegative PCA and nonnegative ICA algorithms are given in [150], where the sources 𝐬𝑖 must be nonnegative. Constrained ICA is a framework that incorporates additional requirements and prior information in form of constraints into the ICA contrast function [151].

16.2. Linear Discriminant Analysis

LDA creates a linear combination of the given independent features that yield the largest mean differences between the desired classes [2]. Given a set of 𝑁 vectors of 𝐽1 dimensions, {𝐱𝑖}, for all the samples of all the 𝐶 classes, the within-class scatter matrix 𝐒𝑤, the between-class scatter matrix 𝐒𝑏, and the mixture scatter matrix 𝐒𝑚 are defined. All the scatter matrices are of size 𝐽1×𝐽1. The minimization of the MSE criterion is equivalent to the minimization of the trace of 𝐒𝑤 or maximizing the trace of 𝐒𝑏.

The objective for LDA is to maximize the between-class measure while minimizing the within-class measure after applying a 𝐽1×𝐽2 transform matrix 𝐖, 𝐽1>𝐽2, which transforms the 𝐽1×𝐽1 scatter matrices into 𝐽2×𝐽2 matrices ̃𝐒𝑤, ̃𝐒𝑏, ̃𝐒𝑚: ̃𝐒𝑤=𝐖𝑇𝐒𝑤̃𝐒𝐖,𝑏=𝐖𝑇𝐒𝑏̃𝐒𝐖,𝑚=𝐖𝑇𝐒𝑚𝐖.(35) The tr(𝐒𝑤) measures the closeness of the samples within the clusters, and tr(𝐒𝑏) measures the separation between the clusters, where tr() denotes the trace operator. An optimal 𝐖 should preserve the given cluster structure, and simultaneously maximize ̃𝐒tr(𝑏) and minimize ̃𝐒tr(𝑤). Assuming that 𝐒𝑤 is a nonsingular matrix, conventionally, the following Fisher’s determinant ratio criterion is maximized for finding the projection directions [152, 153]: 𝐸LDA(̃𝐒𝐖)=det𝑏̃𝐒det𝑤=𝐖det𝑇𝐒𝑏𝐖𝐖det𝑇𝐒𝑤𝐖,(36) where the column vectors 𝐰𝑖, 𝑖=1,,𝐽2, of 𝐖 are the first 𝐽2 principal eigenvectors of 𝐒𝑤1𝐒𝑏. Under the assumption that the class distributions are identically distributed Gaussians, LDA is Bayes optimal [3].

There are at most 𝐶1 nonzero generalized eigenvalues and thus an upper bound on 𝐽2 is 𝐶1; at least 𝐽1+𝐶 samples are needed to guarantee 𝐒𝑤 to be nonsingular. This requirement on the number of samples may be severe for some problems like image processing. In this case, 𝐒𝑤 will be singular and regularization may be necessary. By introducing kernel into 𝐖, nonlinear discriminant analysis is obtained [3, 154]. A multiple of the identity or the kernel matrix can be added to 𝐒𝑤 or its reformulated matrix ̃𝐒𝑤 after introducing the kernels to penalize 𝐰2 or 𝐰2, respectively, [154]. The high-dimensional features can also be first compressed by PCA into intermediate-dimensional space, which is further projected by LDA onto the low-dimensional space. The overall performance of the two-stage approach is sensitive to the reduced dimension in the first stage. A generalization of LDA by using generalized SVD [153] can be used to solve the problem of singularity of 𝐒𝑤. For image feature extraction, two-dimensional LDA algorithms [155, 156] provide an efficient approach and can overcome the singularity problem.

In [152], a nonlinear discriminant analysis network with the MLP as the architecture and Fisher’s determinant ratio as the criterion function is obtained by combining the universal approximation properties of the MLP with the target-free nature of LDA. A layered lateral network-based LDA network and an MLP-based nonlinear discriminant analysis network are also proposed in [111]. Based on a single-layer linear feedforward network, LDA algorithms are also given in [112, 113].