Abstract
Many information processing problems can be transformed into some form of eigenvalue or singular value problems. Eigenvalue decomposition (EVD) and singular value decomposition (SVD) are usually used for solving these problems. In this paper, we give an introduction to various neural network implementations and algorithms for principal component analysis (PCA) and its various extensions. PCA is a statistical method that is directly related to EVD and SVD. Minor component analysis (MCA) is a variant of PCA, which is useful for solving total least squares (TLSs) problems. The algorithms are typical unsupervised learning methods. Some other neural network models for feature extraction, such as localized methods, complex-domain methods, generalized EVD, and SVD, are also described. Topics associated with PCA, such as independent component analysis (ICA) and linear discriminant analysis (LDA), are mentioned in passing in the conclusion. These methods are useful in adaptive signal processing, blind signal separation (BSS), pattern recognition, and information compression.
1. Introduction
In information processing such as pattern recognition, data compression and coding, image processing, high-resolution spectrum analysis, and adaptive beamforming, feature extraction or feature selection is necessary to deal with the large storage of raw data. Feature extraction is a dimensionality-reduction technique, mapping high-dimensional patterns onto a lower-dimensional space by extracting the most prominent features using orthogonal transforms. The extracted features do not have any physical meaning. In contrast, feature selection decreases the size of the feature set or reduces the dimension of the features by discarding the raw information according to a criterion.
Orthogonal decomposition is a well-known technique to eliminate ill-conditioning. The Gram-Schmidt orthonormalization (GSO) is suitable for feature selection. This is due to the fact that the physically meaningless features in Gram-Schmidt space can be linked back to the same number of variables of the measurement space, thus resulting in no dimensionality reduction. The GSO procedure starts with QR decomposition of the transpose of the full feature matrix, , where . QR decomposition can be performed by using the Householder transform or the Givens rotation [1], which is suitable for hardware implementation. The GSO transform can be used for feature subset selection; it inherits the compactness of the orthogonal representation and at the same time provides features that retain their original meaning.
An orthogonal transform can decompose the correlations among the candidate features so that the significance of the individual features can be evaluated independently. Principal component analysis (PCA) is a well-known orthogonal transform that is used for dimensionality reduction. Another popular technique for feature extraction is linear discriminant analysis (LDA), also known as Fisher’s discriminant analysis [2, 3]. Taking all the data into account, PCA computes vectors that have the largest variance associated with them. The generated PCA features do not have clear physical meanings. In contrast, LDA searches for those vectors in the underlying space that best discriminate among the classes rather than those that best describe the data.
For a -dimensional data set of size , PCA [4] generates a -dimensional feature set of the same size, , by using the linear transformation . The weight matrix can be solved under different criteria such as the output variance maximization or MSE minimization. In comparison with the GSO transform, PCA generates each of its features based on the covariance matrix of all the vectors , . Dimensionality reduction is achieved by dropping the variables with insignificant variances. PCA is often used to select inputs, but it is not always useful, since the variance of a signal is not always related to the importance of the signal, for non-Gaussian signals. An improvement on PCA is provided by nonlinear generalizations of PCA, which extend the ability of PCA to incorporate nonlinear relationships in the data. Two-dimensional PCA [5] is designed for image feature extraction.
In this paper, we give a state-of-the-art introduction to various neural network implementations and algorithms for PCA and its extensions. This paper is organized as follows. In Section 2, we introduce the Hebbian learning rule and Oja’s learning rule. Section 3 defines the PCA problem. Various PCA networks and algorithms are treated in Sections 4–7. PCA algorithms based on the Hebbian rule are expanded in Section 4. In Section 5, least means squared error-based PCA methods are dealt with. Other optimization-based PCA methods are described in Section 6. PCA based on the anti-Hebbian rule is treated in Section 7. Nonlinear PCA is addressed in Section 8. Section 9 is dedicated to minor component analysis (MCA). In Section 10, we describe various localized PCA strategies. Section 11 extends all the methods to the complex-valued domain. Some other generalizations of PCA such as constrained PCA, generalized EVD, and two-dimensional PCA are described in Section 12. In Section 13, the cross-correlational PCA asymmetric network and SVD are described. Canonical correlation analysis is described in Section 14. A simulation example of PCA is given in Section 15. A brief summary is given in Section 16, and independent component analysis (ICA) and linear discriminant analysis (LDA) are also mentioned in passing in this section.
2. Hebbian Learning Rule and Oja’s Learning Rule
The stochastic approximation theory [6], introduced by Robbins and Monro in 1951, is an important tool for analyzing stochastic discrete-time systems including the classical gradient-descent method. Given a stochastic discrete-time system of the form where is the state vector, a finite nonzero vector that uses functions as entries, and an unbiased noisy term. Assuming that is a sequence of positive numbers satisfying the Robbins-Monro conditions [6] then the stochastic system (1) can be transformed into a deterministic differential equation If (3) converges to a fixed point , then (1) also converges to as with probability one. The Robbins-Monro conditions [6] require as . Typically, one can select , being a constant, or , [7, 8].
The Hebbian learning rule was introduced in [9]. For a single neuron, the Hebbian rule is written as where the learning rate , is the weight vector from the input to the neuron, is an input vector presented at time , and the output of the neuron is defined by
For a stochastic input vector , applying an analysis on (4) using the stochastic approximation theory, we get the only equilibrium state by maximizing the criterion , where is the expectation operator [10, 11]. The solution is unstable, which drives to infinite magnitude, with a direction parallel to that of the eigenvector of corresponding to the largest eigenvalue [11]. To prevent the divergence of the Hebbian rule, can be normalized after each iteration [12, 13], and this leads to the normalized Hebbian rule. Other methods, such as Oja’s rule [14], Yuille’s rule [15], Linsker’s rule [16], and Hassoun’s rule [11], add a weight-decay term to the Hebbian rule to stabilize the algorithm.
Oja’s rule introduces a weight decay term into the Hebbian rule [14] to prevent instability Oja’s rule converges to a state that maximizes subject to . The solution is the principal eigenvector of [14]. For small , Oja’s rule is proved to be equivalent to the normalized Hebbian rule [14]. Based on stochastic approximation theory, the stable solutions exist only at , if , where is the eigenvector corresponding to , and and are the two largest eigenvalues of [10, 11]. Thus, Oja’s rule always converges to the principal component of .
The Robbins-Monro conditions are not practical for implementation, especially for learning nonstationary data. Zufiria [17] has proposed to convert the stochastic discrete-time algorithms into their deterministic discrete-time formulations that characterize their average evolution from a conditional expectation perspective. Analysis based on this method guarantees the convergence of Oja’s rule by selecting some constant learning rate. Oja’s rule almost always converges exponentially to the unit eigenvector associated with the largest eigenvalue of , starting from points in an invariant set [18]. A constant learning rate for fast convergence is suggested as [18].
3. Principal Component Analysis
PCA is based on the spectral analysis of the second-order moment matrix called correlation matrix that statistically characterizes a random vector. In the zero-mean case, this matrix becomes the covariance matrix. In the area of image coding, PCA is known as Karhunen-Loeve transform (KLT) [4], which exploits correlation between neighboring pixels or groups of pixels for data compression.
PCA allows the removal of the second-order correlation among given random processes. By calculating the eigenvectors of the covariance matrix of the input vector, PCA linearly transforms a high-dimensional input vector into a low-dimensional one whose components are uncorrelated. PCA is directly related to singular value decomposition (SVD), and the most common way to perform PCA is via SVD of the data matrix. However, the capability of SVD is limited for vey large data set.
PCA is often derived by optimizing some information criterion, such as the maximization of the variance of the projected data or the minimization of the reconstruction error. The objective of PCA is to extract orthonormal directions , , , in the input space that account for as much of the data’s variance as possible. Subsequently, an input vector may be transformed into an -dimensional space without losing essential intrinsic information. The vector can be represented by being projected onto the -dimensional subspace spanned by using the inner products , hence achieving dimensionality reduction.
PCA finds those unit directions , along which the projections of the input vectors, known as the principal components (PCs), , have the largest variance where . Based on an analysis using the stochastic approximation theory [10, 11], when , where is a scalar, takes its maximum value. When , becomes a unit vector.
By repeating maximization of but limiting to be orthogonal to , the maximum of is equal to at . Following this deflation procedure, all the principal directions can be derived [11]. The projections , , are the PCs of .
A linear least squares (LS) estimate can be constructed for the original input as . This is a data reconstruction process. The reconstruction error is the difference between the original and reconstructed data Naturally, is orthogonal to . Each principal component is a Gaussian with zero mean and variance .
4. Hebbian Rule-Based Principal Component Analysis
Neural PCA originates from the seminal work by Oja [14]. For a single neuron, the output is given by where the weights to the neuron .
The single-neuron model was extended to a - feedforward network model to extract the first PCs [7]. The architecture of the PCA network is shown in Figure 1. The output of the network is given by where , , , and , being the weight from the th input to the th neuron.
4.1. Subspace Learning Algorithms
By using Oja’s rule (6), will converge to a unit eigenvector of , and the variance of is maximized. For zero-mean input data, this extracts the first PC [14, 19]. In order to keep the algorithm convergent, is required [7]. If , will not converge to even if it is initially close to the target [20]. One can select at the beginning and gradually decrease [7].
The symmetrical subspace learning algorithm (SLA) [7] is a learning algorithm for the PCA network. The SLA is based on Oja’s rule and is given by [7] After the algorithm converges, is roughly orthonormal and the columns of , namely, , , converge to some linear combination of the first principal eigenvectors of [7, 21], which is a rotated basis of the dominant eigenvector subspace. This is called principal subspace analysis (PSA). The value of is dependent on the initial conditions and training samples. The corresponding eigenvalues approximate and can be adaptively estimated by
Weighted SLA introduces asymmetry into the SLA [22, 23]: for , where the coefficients satisfy . In the weighted SLA, almost surely converges to the eigenvectors of . The weighted SLA can perform PCA; however, norms of the weight vectors are not equal to unity.
SLA and weighted SLA are nonlocal algorithms, and they rely on the calculation of the errors and the backward propagation of the values between the layers. A PCA algorithm is obtained by adding a term to SLA [7] so as to rotate the basis vectors in the principal subspace toward the principal eigenvectors [24]. The adaptive learning algorithm (ALA) [20] is also a PCA algorithm that is based on SLA. In ALA, each neuron adaptively updates its learning rate by , where is the estimated eigenvalue and can be estimated using (12), is set to be smaller than and decreases to zero as . All will quickly converge, at nearly the same rate, to in the order of descending eigenvalues. The performance is better than that of the generalized Hebbian algorithm (GHA) [8]. SLA has been extended in [25] so as to extract a noise robust projection.
4.2. Generalized Hebbian Algorithm
By combining Oja’s rule and the GSO procedure, Sanger proposed GHA for extracting the first PCs [8]. GHA can extract the first eigenvectors in the order of decreasing eigenvalues.
The GHA is given by [8] for . GHA becomes a local algorithm by rewriting the summation term in (15) in a recursive form where . Usually is selected as the same for all neurons. In GHA, the th neuron converges to the th PC, and all neurons tend to converge together. and , as .
Both SLA [7, 22] and GHA [8] employ implicit or explicit GSO to decorrelate the connection weights. The weighted SLA [22] performs well for extracting less dominant components.
4.3. Other Hebbian Rule-Based Algorithms
In addition to the popular SLA, weighted SLA and GHA algorithm, there are some other Hebbian rule-based PCA algorithms such as local LEAP (learning machine for adaptive feature extraction via principal component analysis) [26], the nonlocal dot-product-decorrelation (DPD) rule [27], and the local invariant-norm PCA [28].
The LEAP algorithm [26] is a local PCA algorithm for extracting all the PCs and their corresponding eigenvectors. Unlike SLA [7] and GHA [8], whose stability analysis is based on the stochastic approximation theory [6], the stability analysis of LEAP is based on Lyapunov’s first theorem, and as such can be selected as a small positive constant [26]. Due to a constant learning rate, LEAP is capable of tracking nonstationary processes. LEAP can satisfactorily extract PCs even for ill-conditioned autocorrelation matrices [26].
The DPD algorithm is a nonlocal PCA algorithm [27]. It moves , , towards the principal eigenvectors , ordered arbitrarily. The algorithm induces the norms of the weight vectors towards the corresponding eigenvalues, that is, , as . The algorithm breaks the symmetry in its learning process by the difference in the norms of the weight vectors while keeping the symmetry in its structure. The algorithm is as fast as the GHA [8], weighted SLA [22], and least mean squared error reconstruction (LMSER) [29] algorithms.
5. Least Mean Squared Error-Based Principal Component Analysis
Existing PCA algorithms including the Hebbian rule-based algorithms can be derived by optimizing an objective function using the gradient-descent method. The least mean squared error- (LMSE-) based methods are derived from the modified MSE function where is a forgetting factor used for nonstationary observation sequences, and is the current instant. Many adaptive PCA algorithms actually optimize (17) by using the gradient-descent method [29, 30] and the RLS method [30–34].
The gradient-descent or Hebbian rule-based algorithms are highly sensitive to . RLS-based algorithms such as adaptive principal components extraction (APEX) [35], Kalman-type RLS [31], projection approximation subspace tracking (PAST) [30], PAST with deflation (PASTd) [30], and the robust RLS algorithm (RRLSA) [33] can overcome the drawback. All RLS-based PCA algorithms exhibit fast convergence and high tracking accuracy and are suitable for slowly varying nonstationary vector stochastic processes. However, RLS method may cause instability in certain cases. All these algorithms correspond to a three-layer -- linear autoassociative network model, and they can extract all the PCs in the descending order of the eigenvalues, where a GSO-like orthonormalization procedure is used.
In [36], a regularization term is added to (17), where is a stack vector of and is a diagonal matrix with dimension . As is sufficiently large, this term is negligible. This term ensures that the entries of do not become too large. The Gauss-Seidel recursive PCA and Jacobi recursive PCA algorithms are derived in [36].
The LMSER algorithm is derived on the MSE criterion using the gradient-descent method [29]. LMSER reduces to Oja’s SLA algorithm when is orthonormal, namely, . Oja’s algorithm can thus be treated as an approximate stochastic gradient rule to minimize the MSE. LMSER [29] has been compared with the weighted SLA [22] and GHA [8] in [37]. LMSER [29] uses nearly twice as much computation as weighted SLA [22] and GHA [8], for each update of the weight. However, it leads to a smaller asymptotic MSE and faster convergence for the minor eigenvectors [37].
PASTd [30] is a well-known subspace tracking algorithm updating the signal eigenvectors and eigenvalues. PASTd is based on PAST. Both PAST and PASTd are derived for complex-valued signals. Both PAST and PASTd have linear computational complexity, that is, operations every update, as in the cases of SLA [14], GHA [8], LMSER [29], and novel information criterion (NIC) [32]. PAST computes an arbitrary basis of the signal subspace, while PASTd is able to update the signal eigenvectors and eigenvalues. Both the algorithms produce nearly orthonormal, but not exactly orthonormal, subspace basis or eigenvector estimates. If perfectly orthonormal eigenvector estimates are required, an orthonormalization procedure is necessary.
Kalman-type RLS [31] combines the basic RLS algorithm with the GSO procedure. Kalman-type RLS and PASTd are exactly identical if the inverse of the covariance of the output of the th neuron in Kalman-type RLSA takes a special value. In the one-unit case, both PAST and PASTd reduce to Oja’s learning rule [14]. Both PAST and PASTd provide much more robust estimates than eigenvalue decomposition (EVD) and converge much faster than SLA [14]. PASTd has been extended for tracking both the rank and the subspace by using some information theoretic criteria [38].
RRLSA [33] is more robust than PASTd [30]. RRLSA can be implemented in a sequential or parallel form. RRLSA has the flexibility as Kalman-type RLS [31], PASTd [30], APEX [35] in that increasing the number of neurons does not affect the previously extracted principal components. RRLSA naturally selects the inverse of the output energy to be the adaptive learning rate for the Hebbian rule. The Hebbian and Oja rules are closely related to the RRLSA algorithm by suitable selection of the learning rates [33]. RRLSA [33] is also robust to the error accumulation from the previous components, which exists in the sequential PCA algorithms like Kalman-type RLS [31] and PASTd [30]. RRLSA converges fast, even if the eigenvalues extend over several orders of magnitude. According to the empirical results [33], RRLSA provides the best performance in terms of the convergence speed as well as the steady-state error, whereas Kalman-type RLS and PASTd have similar performance, which is inferior to that of RRLSA, and ALA [20] exhibits the poorest performance.
6. Other Optimization-Based Principal Component Analysis
PCA can be derived by any optimization method based on a proper objective function. This leads to many other algorithms, including gradient-descent based algorithms [15, 16, 39, 40], the conjugate gradient (CG) method [41], and the quasi-Newton method [42, 43]. The gradient-descent method usually converges to a local minimum. Some adaptive algorithms derived from the gradient descent, conjugate direction, and Newton-Raphson methods, whose simulation results are better than that of the gradient-descent method [29], have also been proposed in [44]. Second-order algorithms such as the CG [41] and quasi-Newton methods [42] typically converge much faster than first-order methods but have a computational complexity of per iteration.
The infomax principle [16] was first proposed by Linsker to describe a neural network algorithm. The principal subspace is derived by maximizing the mutual information criterion.
The NIC algorithm [32] is obtained by applying the gradient-descent method to maximize the NIC, a cost function that is very similar to the mutual information criterion [16, 45] but integrates a soft constraint on the weight orthogonalization. The NIC has a steep landscape along the trajectory from a small weight matrix to the optimum one. has a single global maximum, and all the other stationary points are unstable saddle points. At the global maximum, yields an arbitrary orthonormal basis of the principal subspace. The NIC algorithm has a computational complexity of for each iteration.
The NIC algorithm is a PSA method. It can extract the principal eigenvectors when the deflation technique is incorporated. The NIC algorithm converges much faster than SLA [22] and LMSER [29] and is able to globally converge to the PSA solution from almost any weight initialization. Reorthonormalization can be applied so as to perform true PCA [30, 32]. An RLS version of the NIC algorithm is given in [32]. The PAST algorithm [30] is a special case of the NIC algorithm when takes unity. The weighted information criterion (WINC) [34] is obtained by adding to the NIC a weight to break the symmetry in the NIC. Two WINC algorithms are derived by using gradient ascent and RLS, respectively. The gradient ascent-based WINC algorithm can be viewed be a weighted SLA [23] with an adaptive step size, leading to a much faster convergence speed. The RLS-based WINC algorithm not only provides fast convergence and high accuracy but also has low computational complexity.
Most popular PCA or MCA algorithms do not consider eigenvalue estimates in the update equations of the weights, and they suffer from the stability-speed problem. The convergence speed of a system depends on the eigenvalues of its Jacobian. In PCA algorithms, the eigenmotion depends on the principal eigenvalue of the covariance matrix, while in MCA algorithms on all the eigenvalues [46]. Coupled learning rules can be derived by applying the Newton method to a common information criterion. In coupled PCA/MCA algorithms [46], both the eigenvalues and the eigenvectors are simultaneously adapted. The Newton method yields averaged systems with identical speed of convergence in all eigendirections.
In order to extract multiple PCs, one has to apply an orthonormalization procedure, like GSO, or its first-order approximation as used in SLA [7, 22], or deflation as in GHA [8]. In the coupled learning rules, multiple PCs are simultaneously estimated by a coupled system of equations. In the coupled learning rules a first-order approximation of GSO is superior to the standard deflation procedure in terms of the orthonormality error and the quality of the eigenvectors and eigenvalues generated [47].
7. Anti-Hebbian Rule-Based Principal Component Analysis
The anti-Hebbian learning rule updates a synaptic weight by the same amount as that in the Hebbian rule [9] but in the opposite direction. The anti-Hebbian rule can be used to remove correlations between units receiving correlated inputs [13, 48, 49]. The anti-Hebbian rule is inherently stable [13, 49].
Anti-Hebbian rule-based PCA algorithms can be derived by using a - feedforward architecture with lateral connections among the output units [13, 48, 49]. The lateral connections can be in a symmetrical or hierarchical topology. The hierarchical lateral connection topology is illustrated in Figure 2, based on which Rubner-Tavan PCA [13, 49] and APEX [50] algorithms are proposed. In [48], the local PCA algorithm is based on a symmetrical lateral connection topology. In addition to the feedforward weights , the lateral weight matrix is a matrix, where contains all the lateral weights connected to neuron and denotes the lateral weight from unit to unit .
The Rubner-Tavan PCA algorithm is based on the PCA network with hierarchical lateral connection topology [13, 49]. The algorithm extracts the first PCs in the decreasing order of the eigenvalues.
The weights are trained by Oja’s rule, while the lateral weights are updated by the anti-Hebbian rule. This is a nonlocal algorithm. During the training process, the outputs of the neurons are gradually uncorrelated and the lateral weights approach zero. The network should be trained until the lateral weights are below a specified level. The PCA algorithm proposed in [48] has the same form as Rubner-Tavan PCA, but is a full matrix.
The APEX algorithm is used to adaptively extract the PCs [50]. The algorithm is recursive and adaptive, namely, given PCs, it can produce the th PC iteratively. The hierarchical structure of lateral connections serves the purpose of weight orthogonalization and also allows the network to grow or shrink without retraining the old units. APEX is proved to have the property of exponential convergence [50].
Assuming that the correlation matrix has distinct eigenvalues arranged in the decreasing order as with the corresponding eigenvectors , the algorithm is given as [35, 50] where is the output vector, , and is the weight matrix of the first neurons. The iteration is given as [35, 50] Equation (19) is the Hebbian part, and (20) the anti-Hebbian part. tends to be orthogonal to all the previous components due to the anti-Hebbian rule, also called orthogonalization rule. APEX can also be derived from the RLS method using the MSE criterion. Based on the RLS method, an optimum learning rate in terms of convergence speed is given by , where is a forgetting factor [35].
A desirable number of neurons can be decided during the learning process. When the environment is changing over time, a new PC can be added to compensate for the change without affecting the previously computed PCs, and the network structure can be expanded if necessary. converges to the eigenvector of the correlation matrix corresponding to the th largest eigenvalue, and converges to zero.
For growing each additional PC, APEX requires per iteration, while GHA requires per iteration.
A class of learning algorithms, called -APEX, is presented based on a criterion optimization [51, 52]. can be selected as an arbitrary function that guarantees the stability of the network. Some members in the class have better numerical performance and require less computational effort compared to that of both GHA and APEX.
Most existing linear complexity methods including GHA [8], SLA [7], and PCA with the lateral connections [13, 35, 48–50] require a computational complexity of per iteration.
8. Nonlinear Principal Component Analysis
PCA is based on the Gaussian assumption for data distribution, and the optimality of PCA results from taking into account only the second-order statistics. For non-Gaussian data distributions, PCA is not able to capture complex nonlinear correlations, and nonlinear processing of the data is usually more efficient. Nonlinearities introduce higher-order statistics into the computation in an implicit way. Higher-order statistics, defined by cumulants, are needed for a good characterization of non-Gaussian data. PCA can be generalized to distributions of the exponential family [53].
When the feature space is nonlinearly related to the input space, we need to use nonlinear PCA. The outputs of nonlinear PCA networks are usually more independent than their respective linear cases. For non-Gaussian input data, a nonlinear PCA permits the extraction of higher-order components and provides a sufficient representation.
Kernel PCA [3, 54] is a special, linear algebra-based nonlinear PCA, which introduces kernel functions into PCA. The kernel PCA first maps the original input data into a high-dimensional feature space using the kernel method and then calculates PCA in the high-dimensional feature space. It is much more complicated and may sometimes be caught more easily in local minima. PCA needs to deal with an eigenvalue problem of a matrix, while kernel PCA needs to solve an eigenvalue problem of an matrix. Sparse approximation methods can be applied to reduce the computational cost [3].
In order to increase the robustness of PCA against outliers, a robust version of the covariance matrix based on the -estimator can be used. Several popular PCA algorithms have been generalized into robust versions by applying statistical physics approach [55], where the defined objective function can be regarded as a soft generalization of an -estimator [56]. Robust PCA can be defined so that the optimization criterion grows less than quadratically with the same constraint conditions as those for PCA, which are based on the quadratic criterion [57]. This usually leads to mildly nonlinear algorithms, in which the nonlinearities appear at selected places only and at least one neuron produces the linear response . When all the neurons generate nonlinear responses , the algorithm is referred to as nonlinear PCA. The robust or nonlinear PCA algorithms are derived by using the gradient-descent method [57]. They can be treated as generalization of SLA [7, 22] and the GHA [8]. Robust and nonlinear PCA algorithms have better stability properties than the corresponding PCA algorithms if the (odd) nonlinearity grows less than linearly, namely, [57]. On the contrary, nonlinearities growing faster than linearly cause stability problems easily and are not recommended. These extensions are also introduced in [29, 58, 59].
The multilayer perceptron (MLP) can be used to perform nonlinear dimensionality reduction and hence nonlinear PCA. Both the input and output layers of the MLP have units, and one of its hidden units, known as the bottleneck or representation layer, have units, . The network is trained to reproduce its input vectors themselves. This kind of networks is called the autoassociative network. After the network is trained, it performs a projection onto the -dimensional subspace spanned by the first PCs of the data. The vectors of weights leading to the hidden units form a basis set which spans the principal subspace, and data compression therefore occurs in the bottleneck layer. Many applications of the autoassociative MLP for PCA are available in the literature [60–63].
Kramer’s nonlinear PCA network [62] is a five-layer autoassociative MLP. It has input and output nodes. The third layer has nodes. Nonlinear activation functions such as the sigmoidal functions are used in the second and fourth layers, while the nodes in the bottleneck and output layers usually have linear activation functions, although they can be nonlinear. The network is trained by the backpropagation (BP) algorithm [10, 64]. Kramer’s nonlinear PCA fits a lower-dimensional surface through the training data. Usually, the data compression achieved in the bottleneck layer in such networks is somewhat better than that provided by the PCA solution [65]. However, the BP algorithm is prone to local minima and often requires excessive time for convergence.
A hierarchical nonlinear PCA network composed of a number of independent subnetworks can extract ordered nonlinear PCs [66]. Each subnetwork extracts one PC and has at least five layers. The subnetworks can be selected as Kramer’s nonlinear PCA network and are hierarchically arranged and trained. This network constructs the extraction functions in the order of the reconstruction efficiency as to the objective data. The number of PCs to be extracted is not required to be known in advance.
A hybrid hetero/autoassociative network [67] is constructed with a set of autoassociative outputs and a set of heteroassociative outputs. Both sets of output nodes are fully connected to the same bottleneck layer. The improvement over an autoassociative network or the PCA can be attributed to the reorganization done by the network in the representation layer space. The self-organizing map (SOM) [68] is a competitive learning-based neural network. It is capable of performing dimensionality reduction on the input. The SOM is inherently nonlinear and is viewed as a nonlinear PCA [69]. The adaptive subspace SOM (ASSOM) [70, 71] can be treated as a hybrid of VQ and PCA.
9. Minor Component Analysis
MCA, as a variant of PCA, is to find the smallest eigenvalues and their corresponding eigenvectors of the autocorrelation matrix of the signals. MCA is closely associated with the curve and surface fitting under the total least squares (TLSs) criterion [72]. MCA provides an alternative solution to the TLS problem [73]. The TLS technique achieves a better global optimal objective than the LS technique [1]. Both the TLS and LS problems can be solved by SVD. However, the TLS technique is computationally much more expensive than the least squares (LSs) technique [74]. MCA is useful in many fields including spectrum estimation, optimization, TLS parameter estimation in adaptive signal processing, and eigen-based bearing estimation.
The anti-Hebbian learning rule and its normalized version can be used for MCA [75], but both may lead to infinite magnitudes of weights [76]. To avoid this, one can renormalize the weight vector at each iteration. The constrained anti-Hebbian learning algorithm [73, 77] has a simple structure and requires a low computational complexity per update. However, the convergence of the magnitudes of the weights cannot be guaranteed either unless the initial weights take special values. The total least mean squares (TLMS) algorithm [74] is a random adaptive algorithm for extracting the MC, which has an equilibrium point under persistent excitation conditions. The TLMS requires about multiplications per iteration, which is twice the complexity of the LMS [78]. An adaptive step-size learning algorithm [79] has been derived for extracting the MC by introducing information criterion. The algorithm globally converges asymptotically to the MC and its corresponding eigenvector. The algorithm outperforms the TLMS [74] in terms of both the convergence speed and the estimation accuracy.
Minor components (MCs) can be extracted in ways similar to that for PCs. A simple idea is to reverse the sign of the PCA algorithms. This is because in many algorithms PCs correspond to the maximum of a cost function, while MCs correspond to the minimum of the same cost function. However, this idea does not work in general and has been discussed in [22]. Oja’s minor subspace analysis (MSA) algorithm can be formulated by reversing the sign of the learning rate of the SLA. This algorithm requires the assumption that the smallest eigenvalue of the autocorrelation matrix is less than unity. However, Oja’s MSA algorithm is known to diverge [22, 80, 81]. The bigradient PSA algorithm [82] is a modification to SLA [7] and is obtained by introducing an additional bigradient term embodying the orthonormal constraints of the weights, and it can be used for MSA by reversing the sign of .
A general algorithm that can extract, in parallel, principal and minor eigenvectors of arbitrary dimensions is derived based on the natural-gradient method [83]. The difference between PCA and MCA lies in the sign of the learning rate. The MCA algorithm proposed in [83] suffers from a marginal instability, and thus it requires intermittent normalization such that [80]. A self-stabilizing MCA algorithm has been proposed in [80], such that none of deviates significantly from unity. It diverges for PCA when is changed to .
The orthogonal Oja (OOja) algorithm consists of Oja’s MSA [7] plus an orthogonalization of at each iteration [84], . In this case, the above algorithms given in [7, 80, 83] are equivalent. The OOja is numerically very stable. By reversing the sign of , we extract PCs. The normalized Oja (NOja) is derived by optimizing the MSE criterion subject to an approximation to the orthonormal constraint [81]. This leads to the optimal learning rate. The normalized orthogonal Oja (NOOja) is an orthogonal version of the NOja such that the orthonormal constraint is perfectly satisfied [81]. Both algorithms offer, as compared to Oja’s SLA, a faster convergence, orthogonality, and a better numerical stability with a slight increase in the computational complexity. By switching the sign of in given learning algorithms, both the NOja and the NOOja can be used for the estimation of minor and principal subspaces of a vector sequence.
The above algorithms including Oja’s MSA [7], the natural-gradient-based method [83], self-stabilizing MCA [80], OOja, NOja, and NOOja have a complexity of [80, 84]. OOja, NOjia, and NOOjia require less computation load than the natural-gradient-based method [83], self-stabilizing MCA [80, 81, 84].
By using the Rayleigh quotient as an energy function, invariant-norm MCA [85] is analytically proved to converge to the first MC of the input signals. However, invariant-norm MCA [85] leads to divergence in finite time [76], and this drawback can be eliminated by renormalizing the weight vector at each iteration. In [86], an MCA algorithm for extracting multiple MCs is described by using the idea of sequential addition; a conversion method between MCA and PCA is also discussed. Based on a generalized differential equation for the generalized eigenvalue problem, a class of algorithms can be obtained for extracting the first PC or MC by selecting different parameters and functions [87]. Many existing PCA and MCA algorithms are special cases of this class. All the algorithms of this class have the same order of convergence speed and are robust to implementation error. A rapidly convergent quasi-Newton method has been applied to extract multiple MCs in [88]. The proposed algorithm has a complexity of but with a quadratic convergence. It makes use of the implicit orthogonalization procedure that is built into it through an inflation technique.
10. Localized Principal Component Analysis
The nonlinear PCA problem can be solved by partitioning the data space into a number of disjunctive regions and then estimating the principal subspace within each partition by linear PCA. This is the so-called localized PCA. The distribution is collectively modeled by a collection or a mixture of linear PCA models, each characterizing a partition. Most natural data sets have large eigenvalues in only a few eigendirections, while the variances in other eigendirections are so small as to be considered as noise. The localized PCA method provides an efficient means to decompose high-dimensional data compression problems into low-dimensional data compression problems. The localized PCA method is commonly used in image compression [8]. An image is often first transformation coded by PCA, and then the coefficients are quantized.
VQ-PCA [65] is a locally linear model that uses VQ to define the Voronoi regions for localized PCA. The algorithm builds a piecewise linear model of the data. It performs better than the global models implemented by PCA model and Kramer’s nonlinear PCA and is significantly faster than Kramer’s nonlinear PCA [65]. Adaptive combination of PCA and VQ is given in [89], where an autoassociative network is used to perform PCA and simple competitive learning is used to perform VQ. The error between the input and output of the autoassociative network is fed to the VQ network. The network produces better results than by using the two algorithms successively. An online localized PCA algorithm [90] is developed by extending the neural gas method [91]. Instead of the Euclidean distance measure, a combination of a normalized Mahalanobis distance and the squared reconstruction error guides the competition between the units. Weighting between the two measures is determined from the residual variance in the minor subspace of each submodel. The unit centers are updated as in neural gas, while subspace learning is based on the RRLSA algorithm [33].
Similar to localized PCA, localized ICA is used to characterize nonlinear ICA. Clustering is first used for an overall coarse nonlinear representation of the underlying data and linear ICA is then applied in each cluster so as to describe local features of the data [92]. This leads to a better representation of the data than in linear ICA. ASSOM [93] is another localized PCA for unsupervised extraction of invariant local features from the input data. ASSOM associates a subspace instead of a single weight vector to each node of the SOM. The subspaces in ASSOM can be formed by applying ICA [94].
11. Extending to Complex Domain
Complex PCA is a generalization of PCA in complex-valued data sets [95]. Complex PCA has been widely applied to complex-valued data and two-dimensional vector fields. Complex PCA employs the same neural network architecture as that of PCA, but with complex weights. The objective functions for PCA can also be adapted to complex PCA by changing the transpose into the Hermitian transpose. For example, for complex PCA, complex PCs can be extracted by minimizing the MSE function where is the th input complex vector.
In [96], a complex-valued neural network model is developed for nonlinear complex PCA. Nonlinear complex PCA has the ability to extract nonlinear features missed by PCA. It uses the architecture of Kramer’s nonlinear PCA network [62], but with complex weights and biases. For a similar number of model parameters, the nonlinear complex PCA model captures more variance of a data set than the alternative real approach, where each complex variable is replaced by two real variables and is applied to Kramer’s nonlinear PCA. The complex nonlinear transfer function is selected as with [97]. To limit the modulus of the net input of a neuron to be less than , one can initialize the algorithm with weights and biases of small magnitudes and use a weight penalty in the objective function. A complex-valued BP or quasi-Newton algorithm can be used for training.
There are many other complex PCA algorithms. Both PAST and PASTd are, respectively, the PSA and PCA algorithms derived for complex-valued signals [30]. A heuristic complex extension of GHA [8] and APEX [35] is, respectively, given in [98, 99]. The robust complex PCA algorithms have also been derived in [100] for hierarchically extracting PCs of complex-valued signals based on a robust statistics-based loss function. Concerning complex MCA, the constrained anti-Hebbian algorithm [73, 77] has been extended for the complex-valued TLS problem [77] and has been applied to adaptive FIR and IIR filtering. The adaptive invariant-norm MCA algorithm [85] has been generalized to the case for complex-valued input signal vector . For ICA algorithms, FastICA has been applied to complex signals [101]. The -APEX algorithms and GHA are, respectively, extended to the complex-valued case [102, 103]. Based on a suitably selected nonlinear function, these algorithms can be used for BSS of complex-valued circular source signals.
12. Other Generalizations of PCA
Simple neural models, described by differential equations, are derived in [104, 105] to calculate the largest and smallest eigenvalues as well as their corresponding eigenvectors of any real symmetric matrix. Supervised PCA [106, 107] is achieved by augmenting the input of the PCA with the class label of the data set. Given a sample covariance matrix, we examine the problem of maximizing the variance explained by a linear combination of the input variables while constraining the number of nonzero coefficients in this combination. This is known as sparse PCA [108]. Constrained PCA, generalized EVD, and the two-dimensional PCA are three important generalizations to PCA.
12.1. Constrained Principal Component Analysis
When certain subspaces are less preferred than others, this yields constrained PCA [109]. The optimality criterion for constrained PCA is variance maximization, as in PCA, but with an external subspace orthogonality constraint that extracted PCs are orthogonal to some undesired subspaces. PCA usually obtains the best fixed-rank approximation to the data in the LS sense. On the other hand, constrained PCA allows specifying metric matrices that modulate the effects of rows and columns of a data matrix. This actually is the weighted LS estimation. Constrained PCA first decomposes the data matrix by projecting the data matrix onto the spaces spanned by matrices of external information and then applies PCA to decomposed matrices, which involves the generalized SVD. The APEX algorithm has been applied to recursively solve the constrained PCA problem [35].
The constrained PAST algorithm [110] is for tracking the signal subspace recursively. Based on an interpretation of the signal subspace as the solution of a constrained minimization task, it guarantees the orthonormality of the estimated signal subspace basis at each update, hence avoiding orthonormalization process. To reduce the computational complexity, fast constrained PAST is introduced which has complexity. A signal subspace rank estimator is employed to track the number of sources.
12.2. Generalized Eigenvalue Decomposition
Generalized EVD is a statistical tool extremely useful in feature extraction, pattern recognition as well as signal estimation and detection. The generalized EVD problem involves the matrix equation where , , and , are, respectively, the th generalized eigenvalue and its corresponding generalized eigenvector. For real symmetric and positive definite matrices, all the generalized eigenvectors are real and the corresponding generalized eigenvalues are positive.
Generalized EVD achieves simultaneous diagonalization of and : where and . Typically, and are, respectively, the full covariance matrices of zero-mean stationary random signals , . In this case, iterative generalized EVD algorithms can be obtained by using two PCA steps. When becomes an identity matrix, the generalized EVD reduces to PCA.
Any generalized eigenvector is a stationary point of the criterion function The LDA problem is a typical generalized EVD problem. The three-layer LDA network [111] is obtained by concatenating two Rubner-Tavan PCA subnetworks, each being trained by the Rubner-Tavan PCA algorithm [13, 49]. Based on the Rubner-Tavan PCA network architecture, online local learning algorithms for LDA and generalized EVD are given in [112]. There are also a number of adaptive methods for generalized EVD such as LDA-based gradient descent [113, 114], a quasi-Newton type generalized EVD [115], an RLS-like fixed-point generalized EVD algorithm [116], error-correction learning [112], and Hebbian learning [112]. All these algorithms first extract the principal generalized eigenvector and then estimate the minor generalized eigenvectors using a deflation procedure.
A recurrent network with invariant -norm [117] computes the largest or smallest generalized eigenvalue and the corresponding eigenvector of any symmetric positive pair, which can be simply extended to compute the second largest or smallest generalized eigenvalue and the corresponding eigenvector. In [118], the proposed unconstrained quartic cost function based on the weighted rule has a unique global minimum, which corresponds to the principal generalized eigenvectors.
12.3. Two-Dimensional PCA
Two-dimensional PCA [5] is especially designed for image representation. An image covariance matrix is constructed directly using the original image matrices instead of the transformed vectors, and its eigenvectors are derived for image feature extraction. For images, the size of the image covariance (scatter) matrix using two-dimensional PCA is , whereas, for PCA, the size is . This results in considerable computational advantage in two-dimensional PCA. Two-dimensional PCA evaluates the covariance matrix more accurately over PCA. When used for face recognition, two-dimensional PCA results in a better recognition accuracy. Two-dimensional PCA is a row-based PCA, and it only reflects the information between rows. Diagonal PCA [119] improves two-dimensional PCA by defining the image scatter matrix as the covariances between the variations of the rows and those of the columns of the images and is shown to be more accurate than PCA and two-dimensional PCA. -norm-based two-dimensional PCA [120] is a iterative two-dimensional generalization of -norm based PCA.
Bidirectional PCA [121] reduces the dimension in both column and row directions for image feature extraction. The feature dimension obtained is much less than that of two-dimensional PCA. Two-dimensional PCA can be regarded as a special bidirectional PCA. Bidirectional PCA has to be performed in batch mode. With the concepts of tensor, -mode unfolding and matricization, an SVD-revision-based incremental learning method of bidirectional PCA [122] gives a close approximation to bidirectional PCA, but using less time.
The uncorrelated multilinear PCA algorithm [123] is used for unsupervised subspace learning of tensorial data. It is a multilinear extension of PCA. Through successive variance maximization, uncorrelated multilinear PCA seeks a tensor-to-vector projection that captures most of the variation in the original tensorial input while producing uncorrelated features. It is the only multilinear extension of PCA that can produce uncorrelated features in a fashion similar to that of PCA, in contrast to other multilinear PCA extensions, such as two-dimensional PCA [5] and multilinear PCA (MPCA) [124].
13. Singular Value Decomposition
Given two sets of random vectors with zero mean, and , the cross-correlation matrix is defined by where is the th singular value, and are its corresponding left and right singular vectors, and . The cross-correlation asymmetric PCA/MCA networks can be used to extract the singular values of the cross-correlation matrix of two stochastic signal vectors or to implement SVD of a general matrix.
Cross-correlation asymmetric PCA (APCA) network consists of two sets of neurons that are laterally hierarchically connected [125]. The topology of the network is shown in Figure 3. and are, respectively, the -dimensional and -dimensional input signals, the matrix and the matrix are the feedforward weights, while the matrices and are the lateral connection weights, where , , and . This model performs SVD of [125].
The network has the following relations: where and .
The objective function for extracting the first principal singular value of the covariance matrix is given by It is an indefinite function. When , it reduces to PCA [14]. After the principal singular component is extracted, a deflation transformation is introduced to nullify the principal singular value so as to make the next singular value principal. Thus, in the criterion (27) can be replaced by a transformed form so as to extract the next principal singular component.
Using a deflation transformation, the two sets of neurons are trained with the cross-coupled Hebbian learning rules, which are given in [125, 126]. and approximate the th left and right principal singular vectors of , respectively, and approximates its corresponding criterion , as . The algorithm extracts the first principal singular values in the descending order and their corresponding left and right singular vectors. Like APEX, the APCA algorithm incrementally adds nodes without retraining the learned nodes. Exponential convergence has been demonstrated by simulation [125].
Based on the APCA network, the principal singular component of can be efficiently extracted by using a modification to the cross-coupled Hebbian rule with global asymptotic convergence [127]. This algorithm is extended for extracting the principal singular component of a general matrix by replacing the cross-correlation matrix by a general nonsquare matrix [127]. Based on this and a deflation transformation, one can extract multiple principal singular components for the nonsquare matrix [127]. The algorithm can efficiently perform SVD of an ill-posed matrix. It can be used to solve the smallest singular component of the general matrix and is especially useful for TLS problems. Some adaptive SVD algorithms for subspace tracking of a recursively updated data matrix have been surveyed and proposed in [128].
Coupled learning rules for SVD produce better results than Hebbian learning rules. Combined with first-order approximation of GSO, precise estimates of singular vectors and singular values with only small deviations from orthonormality are produced. Double deflation is clearly superior to standard deflation but inferior to first-order approximation of GSO, both with respect to orthonormality and diagonalization errors. Coupled learning rules converge faster than Hebbian learning rules, and the first-order approximation of GSO produces more precise estimates and better orthonormality than standard deflation [129]. Many SVD algorithms are reviewed in [129].
Tucker decomposition [130] decomposes a three-dimensional signal directly using three-dimensional PCA, which is a multilinear generalization of SVD to multidimensional data. For video frames, this higher-order SVD decomposes the dynamic texture as a multidimensional signal (tensor) without unfolding the video frames on column vectors. This is a more natural and flexible decomposition, since it permits us to perform dimension reduction in the spatial, temporal, and chromatic domains between the pixels of the video sequence, leading to an important decrease in model size, while standard SVD allows for temporal reduction only. The analysis part is more expensive, but the synthesis has the same cost as existing algorithms [131].
14. Canonical Correlation Analysis
CCA [132], proposed by Hotelling in 1936, is a multivariate statistical technique. It makes use of two views of the same set of objects and projects them onto a lower-dimensional space in which they are maximally correlated. CCA seeks prominently correlated projections between two views of data, and it has been long known to be equivalent to LDA when the data features are used in one view and the class labels are used in the other view [133, 134]. In other words, LDA is a special case of CCA. CCA is equivalent to LDA for binary-class problems [134], and it can be formulated as an LS problem for binary-class problems.
CCA leads to a generalized EVD problem. Thus, we can employ a kernelized version of CCA to compute a flexible contrast function for ICA. Generalized CCA consists of a generalization of CCA to more than two sets of variables [135].
Given two centered random multivariables and , the goal of CCA is to find a pair of directions and such that the correlation between the two projections and is maximized.
Suppose that we are given a sample of instances of . Let denote and similarly denote . We can consider defining a new coordinate for by choosing direction and projecting onto that direction, . If we do the same for by choosing direction , we obtain a sample of the new mapping for . Let with the corresponding values of the mapping for being The first stage of canonical correlation is to choose and to maximize the correlation between the two vectors
After manipulation, we have where the covariance matrix of is defined by
Under a mild condition which tends to hold for high-dimensional data, CCA in the multilabel case can be formulated as an LS problem [136]. Based on this, efficient algorithms for solving LS problems can be applied to scale CCA to very large data sets. In addition, several CCA extensions, including the sparse CCA formulation based on -norm regularization, are proposed [136]. The LS formulation of CCA and its extensions can be solved efficiently. The LS formulation is extended to orthonormalized partial least squares by establishing the equivalence relationship between CCA and orthonormalized partial least squares [136]. The CCA projection for one set of variables is independent of the regularization on the other set of multidimensional variables.
In [137], a strategy for reducing LDA to CCA is proposed. Within-class coupling CCA (WCCCA) is to apply CCA to pairs of data samples that are most likely to belong to the same class. Each one of the samples of a class, serving as the first view, is paired with every other samples of that class serving as the second view. The equivalence between LDA and such an application of CCA is proved.
Two-dimensional CCA seeks linear correlation based on images directly. Motivated by locality-preserving CCA [138] and spectral clustering, a manifold learning method called local two-dimensional CCA [139] identifies the local correlation by weighting images differently according to their closeness. That is, the correlation is measured locally, which makes local two-dimensional CCA more accurate in finding correlative information. Local two-dimensional CCA is formulated as solving generalized eigenvalue equations tuned by Laplacian matrices.
CCArc [140] is a two-dimensional CCA that is based on representing the image as the sets of its rows and columns and implementation of CCA using these sets. CCArc does not require preliminary downsampling procedure, it is not iterative and it is applied along the rows and columns of input image. Size of covariance matrices in CCArc is equal to Small-sample-size problem in CCArc does not occur, because we actually use images of size and images of size ; this always meets the condition .
15. A Simulation Example
The concept of subspace is involved in many information processing problems. This requires EVD of the autocorrelation matrix of a data set or SVD of the cross-correlation matrix of two data sets. For example, in the area of array signal processing, the APCA algorithm is used in beamforming [126] and the MCA or PCA method is used in DoA estimation [141]. We have also illustrated weighted SLA, GHA, and APEX in [10]. We now provide an example to illustrate the application of PCA.
Image compression is usually implemented by partitioning an image into many nonoverlapping pixel blocks and then compressing them one by one. Based on the statistics of all the regions, one can use PCA to compress the image [8, 10]. Each region is concatenated into a vector, and all the vectors constitute a training set. PCA is then applied to extract those prominent PCs, as such the image is compressed. Similar results for image compression have been reported in [11, 142] by using a three-layer autoassociative network with BP learning. PCA as well as LDA achieves the same results for an original data set and its orthonormally transformed version [143]; thus, both methods can be directly implemented in the DCT domain, and the results are exactly the same as that obtained from the spatial domain.
In this example, we use the APEX algorithm to train the PCA network and then use the trained network to encode other pictures. We use the benchmark Lina picture of pixels as the training set, and a kid picture of , which has the similar statistics, is then used for generalization. We first use blocks. For each square, only the first PC is significant, and all the other PCs can be ignored in terms of the quality of the restored picture. In this sense, we achieve a compression ratio of 1 : 64. However, the quality of the restored image is very poor. This is because PCA assumes the Guassian statistics of the data set and a real picture usually does not satisfy this assumption.
To improve the quality of the restored image, we employ nonoverlapping blocks and again use the first PCs of the blocks. This achieves a compression ratio of 1 : 16. The Lena picture is thus transformed into a training set of 14400 vectors, and the kid picture is transformed into a validation set of 19200 vectors. The training set is so large that the APEX algorithm converges after only two epochs. The obtained codebook is = (0.1949, 0.2222, 0.1920, 0.1747, 0.2255, 0.2065, 0.2195, 0.2246, 0.1972, 0.1701, 0.2268, 0.1908, 0.2224, 0.1996, 0.2140, 0.1783). The Lena picture and its restored version are shown in Figure 4. The trained network is then used to compress the kid picture. Both the kid picture and its restored version are shown in Figure 5.
16. Summary
In this paper, we have discussed various neural network implementations and algorithms for PCA and its various extensions, including PCA, MCA, generalized EVD, constrained PCA, two-dimensional methods, localized methods, complex-domain methods, and SVD. These neural network methods provide an advantage over their conventional counterparts in that they are adaptive algorithms and have low computational as well as low storage complexity. These neural network methods find wide applications in pattern recognition, blind source separation, adaptive signal processing, and information compression. Two methods that are strongly associated with PCA, namely, ICA and LDA, are described here in passing.
16.1. Independent Component Analysis
ICA [144] is a statistical model that extends PCA. ICA has been widely used for BSS, feature extraction, and signal detection. For BSS applications, the ICA model is required to have model identifiability and separability [144]. The first neural network model related to the ICA was developed for online BSS of linearly mixed signals in [145].
ICA can extract the statistically independent components from the input data set. It is to estimate the mutual information between the signals by adjusting the estimated matrix to give outputs that are maximally independent. By applying ICA to estimate the independent input data from raw data, a statistical test can be derived to reduce the input dimension. The dimensions to remove are those that are independent of the output. In contrast, in PCA the dimensionality reduction is achieved by removing those dimensions that have a low variance.
Let a -vector denote a linear mixture and a -vector , whose components have zero mean and are statistically mutually independent, denote the original source signals. The ICA data model can be written as where is an unknown constant full-rank mixing matrix and denotes the additive noise term, which is often omitted since it is usually impossible to separate noise from the sources. ICA takes one of three forms, namely, square ICA for , overcomplete ICA for , and undercomplete ICA for . While undercomplete ICA is useful for feature extraction, overcomplete ICA may be applied to signal and image processing methods based on multiscale and redundant basis sets.
The goal of ICA is to estimate by such that the components of , which is the estimate of , are statistically as independent as possible, being a demixing matrix. The statistical independence property implies that the joint probability density of the components of equals the product of the marginal densities of the individual components. Each component of is a stationary stochastic process, and only one of the components is allowed to be Gaussian distributed. The higher-order statistics of the original inputs are required for estimating , rather than the second-order moment or covariance of the samples as used in PCA.
The demixing matrix of ICA is not orthogonal, while in PCA the components of the weights are represented on an orthonormal basis. ICA provides in many cases a more meaningful representation of the data than PCA. ICA can be realized by adding nonlinearity to linear PCA networks such that they are able to improve the independence of their outputs. In [146], an efficient ICA algorithm is derived by minimizing a nonlinear PCA criterion using the RLS approach. The three-layer (--) linear autoassociative network can also be used as an ICA network, as long as the outputs of the hidden layer are independent.
A well-known two-phase approach to ICA is to preprocess the data by PCA and then to estimate the necessary rotation matrix. A generic approach to ICA consists of preprocessing the data, defining measures of non-Gaussianity, and optimizing an objective function, known as a contrast function. Some measures of non-Gaussianity are kurtosis, differential entropy, negentropy, and mutual information, which can be derived from one another. Popular ICA algorithms include the Infomax, the natural-gradient, the equivariant adaptive separation via independence (EASI), and the FastICA algorithms [10]. FastICA is a well-known fixed-point ICA algorithm [147]. It is derived from the optimization of the kurtosis or the negentropy measure by using Newton’s method. FastICA achieves a reliable and at least a quadratic convergence. FastICA with symmetric orthogonalization and tanh nonlinearity is concluded as the best trade-off for ICA [10].
The ICA methods can be easily extended to the complex domain by using Hermitian transpose and complex nonlinear functions. In the context of BSS, the higher-order statistics are necessary only for temporally uncorrelated stationary sources. Second-order statistics-based source separation exploits temporally correlated stationary sources and the nonstationarity of the sources [148]. Many natural signals are inherently nonstationary with time-varying variances, since the source signals incorporate time delays into the basic BSS model.
Blind separation of the original signals in nonlinear mixtures has many difficulties such as the intrinsic indeterminacy, the unknown distribution of the sources as well as the mixing conditions, and the presence of noise. It is impossible to separate the original sources using only the source independence assumption of some unknown nonlinear transformations of the sources [149]. Nonlinear ICA can be modeled by a parameterized neural network whose parameters can be determined under the criterion of independence of its outputs. The inverse of the nonlinear mixing model can be modeled by using the three-layer MLP, the RBFN, the SOM, or the kernel-based nonlinear BSS method [10].
Nonnegativity is a natural condition for many real-world applications, for example, in the analysis of images, text, or air quality. Neural networks can be suggested by imposing a nonnegativity constraint on the outputs [29] or weights. Nonnegative PCA and nonnegative ICA algorithms are given in [150], where the sources must be nonnegative. Constrained ICA is a framework that incorporates additional requirements and prior information in form of constraints into the ICA contrast function [151].
16.2. Linear Discriminant Analysis
LDA creates a linear combination of the given independent features that yield the largest mean differences between the desired classes [2]. Given a set of vectors of dimensions, , for all the samples of all the classes, the within-class scatter matrix , the between-class scatter matrix , and the mixture scatter matrix are defined. All the scatter matrices are of size . The minimization of the MSE criterion is equivalent to the minimization of the trace of or maximizing the trace of .
The objective for LDA is to maximize the between-class measure while minimizing the within-class measure after applying a transform matrix , , which transforms the scatter matrices into matrices , , : The measures the closeness of the samples within the clusters, and measures the separation between the clusters, where denotes the trace operator. An optimal should preserve the given cluster structure, and simultaneously maximize and minimize . Assuming that is a nonsingular matrix, conventionally, the following Fisher’s determinant ratio criterion is maximized for finding the projection directions [152, 153]: where the column vectors , , of are the first principal eigenvectors of . Under the assumption that the class distributions are identically distributed Gaussians, LDA is Bayes optimal [3].
There are at most nonzero generalized eigenvalues and thus an upper bound on is ; at least samples are needed to guarantee to be nonsingular. This requirement on the number of samples may be severe for some problems like image processing. In this case, will be singular and regularization may be necessary. By introducing kernel into , nonlinear discriminant analysis is obtained [3, 154]. A multiple of the identity or the kernel matrix can be added to or its reformulated matrix after introducing the kernels to penalize or , respectively, [154]. The high-dimensional features can also be first compressed by PCA into intermediate-dimensional space, which is further projected by LDA onto the low-dimensional space. The overall performance of the two-stage approach is sensitive to the reduced dimension in the first stage. A generalization of LDA by using generalized SVD [153] can be used to solve the problem of singularity of . For image feature extraction, two-dimensional LDA algorithms [155, 156] provide an efficient approach and can overcome the singularity problem.
In [152], a nonlinear discriminant analysis network with the MLP as the architecture and Fisher’s determinant ratio as the criterion function is obtained by combining the universal approximation properties of the MLP with the target-free nature of LDA. A layered lateral network-based LDA network and an MLP-based nonlinear discriminant analysis network are also proposed in [111]. Based on a single-layer linear feedforward network, LDA algorithms are also given in [112, 113].