#### Abstract

We present here assessment of the latent market information embedded in the raw, affinity (normalized), and partial correlations. We compared the Zipf plot, spectrum, and distribution of the eigenvalues for each matrix with the results of the corresponding random matrix. The analysis was performed on stocks belonging to the New York and Tel Aviv Stock Exchange, for the time period of January 2000 to March 2009. Our results show that in comparison to the raw correlations, the affinity matrices highlight the dominant factors of the system, and the partial correlation matrices contain more information. We propose that significant stock market information, which cannot be captured by the raw correlations, is embedded in the affinity and partial correlations. Our results further demonstrate the differences between NY and TA markets.

#### 1. Introduction

Stock markets behave as complex dynamic systems, and as such, it is critical to investigate the dependencies (interactions) between the dynamics of the system variables (stocks, bonds, etc.). It is common to associate such interactions with the notion of correlation, or similarity. Indeed, much effort is dedicated to study and understand such stock cross-correlations [1, 2] in an attempt to extract maximum market latent information that is embedded in the interactions between the market variables.

Extraction of relevant information from the empirical cross-correlations between stocks is difficult, as has been noted by Plerou et al. [3, 4]. To discern to what extent the empirical stock correlation coefficients are a result of random noise rather than important system information, they proposed to make use of Random Matrix Theory (RMT). RMT is a tool that was originally developed and used in nuclear physics by Wigner and Dirac [5], Dyson [6], and Mehta [7] in order to explain the statistics of the energy levels of complex quantum systems. Recently, it has also been applied to studies of economic data [3, 4, 8–20].

Whereas the main focus of RMT analysis of stock data has been on the analysis of stock cross-correlation matrices, here we assess the importance of RMT investigations for two alternative similarity measures. The first is a measure of normalized correlations, computed by a collective normalization of the correlation matrix, which has been termed the affinity transformation [2, 21, 22]. The second is a measure of partial correlations [23], which provides a way to study the residual stock correlations, after removing the mediating effect of the index. This measure was chosen following our previous work [2], where we have used this measure to show the influence of the Index on stock correlations.

Here we analyze financial time series from the New York stock market (S&P500, available at [24]) and the Tel-Aviv stock market (TASE, available at [25]). For the S&P stock data, we studied a set of 445 stocks, for the trading time period of 03/01/2000 to 15/003/2009. In addition, we also investigated the Dow Jones Industrial Average (DJIA) index and its constituting stocks [24], which are also members of the S&P500 index. For the TA data, we only used stocks with substantial capital and substantial daily turnover for the same time period. For each stock, we use the daily adjusted closing price to compute the stock return, using the standard transformation where is the daily adjusted closing price of stock , and the time interval was taken to be equal to 1 trading day. The result of the screening process [2] is 445 stocks and the S&P500 index for the S&P dataset, 30 stocks and the DJIA index for the DJIA dataset, and 26 stocks and the TA25 index, for the TA dataset.

The comparison between the different similarity measures is performed employing the commonly used RMT. The distribution of the eigenvalues of each matrix is evaluated and compared to the expected distribution of eigenvalues of the corresponding random matrix (the original matrix after random shuffling of the elements, also known as the Wishart matrix [3]).

#### 2. Stock Correlation Metrics

The three similarity measures we use in this work are: () Raw correlations —the Pearson's pairwise correlation between stocks and ; () Affinity correlations —the normalized correlation between stocks and , according to the correlation of each one with all other stocks, computed by the affinity collective normalization [2, 21, 22]; () Partial correlations —the residual correlation between stocks and , after removing the mediating effect of a stock .

##### 2.1. Raw Correlation Matrices

The stock raw correlations are calculated using the Pearson correlation coefficient where and are the return of stock and and denote the corresponding means, and are the corresponding standard deviations (STDs), and denotes the average over time. Note that is a symmetric square matrix and for all .

##### 2.2. Normalized Correlation (Affinity) Matrices

The matrices are normalized using the affinity transformation, a special collective normalization procedure first proposed by Baruchi et al. [21, 22]. The idea is to normalize the correlations between each pair of stocks according to the correlations of each of the two stocks with all other stocks. This process is in fact a calculation of the correlation of correlations or metacorrelation. The metacorrelations are the Pearson's correlation between rows and in the correlation matrix after reordering. In the reordering process, the elements and are taken out. The correlation vector for is and for it is , In other words, the metacorrelation is a measure of the similarity between the correlations of stock with all other stocks to the correlations of stock with all other stocks. Using the metacorrelations, the normalized correlations are The affinity transformation process emphasizes subgroups of variables (stocks) in the system, by removing the effect of the background noise of correlation. Groups (clusters) identified in the affinity matrix are significant in the system, and warrant further investigation. We demonstrate the strength of the affinity transformation in Figure 1, where we compare the S&P500 dataset correlation matrix to its affinity matrix. Both matrices are ordered similarly, and the groups weakly visible in the correlation matrix (Figure 1(a)) are emphasized and highlighted by the affinity transformation process (Figure 1(b)).

**(a)**

**(b)**

**(c)**

**(d)**

##### 2.3. Partial Correlation Matrices

Recently [2], the concept of partial correlation has been applied in the study of financial data. A partial correlation is a statistical measure indicating how a third mediating variable affects the similarity between two variables [23]. The partial correlation coefficient is given by where is the pairwise correlation between stock and is the pairwise correlation between stock and the index , and is the pairwise correlation between stock and the index.

As we have recently reported [2], computing the partial correlations between stocks after removing the mediating effect of the index reveals the naked structure of the market. Thus, here we will repeat this process and compute the stock partial correlation matrices for the three datasets, using the index as the mediating variable.

#### 3. Random Matrix Theory (RMT)

Using RMT, we study whether there are eigenvalues that deviate from the range of the random ones (the eigenvalues of the corresponding random matrix). The eigenvalues of the random matrix represent the limit of zero-embedded information (maximum entropy), hence the deviation from this range provides a measure for relevant latent information about the market organization that is embedded in the correlation matrices.

##### 3.1. Eigenvalues of Random Matrix

Let us consider a portfolio of stocks, each with time records. If these time series are uncorrelated, then the resulting random matrix (also known as the Wishart matrix [3]) has eigenvalues that follow the distribution with , and . This distribution is bounded by The spectral density is different from zero in the interval and in the case of correlation matrices, (other values for have been proposed in the past, see, e.g., [11, 13]). Using the distribution presented in (3.1), we study the eigenvalue distribution for the different sets of stocks.

##### 3.2. Assessment of the Embedded Information

We compare the eigenvalues statistics and scaling of the raw, affinity and partial correlation matrices, to those of the corresponding random matrices. It has been proposed in the past by Plerou et al. [3, 4], Laloux et al. [14], Garas and Argyrakis [11], and others that eigenvalues deviating from the eigenvalue regime of the random matrix, that is, , contain the relevant economic information stored in the correlation matrices.

##### 3.3. Zipf Plot of the Eigenvalues and Power Law Scaling

While inspection of the eigenvalue distribution was found to be useful, it cannot be used for the case of small markets, and it does not always capture important information embedded in the scaling behavior of the leading eigenvalues. To overcome these difficulties, we investigated the scaling of the eigenvalues. Following the notion of a Zipf law that was found in text [26, 27], we simply plot the values of the ordered eigenvalues as function of their rank using a log-log scale. This simple presentation was found to be quite powerful in distinguishing between the different similarity measures and revealed a power law-scaling behavior for the correlation matrices.

##### 3.4. Eigenvalue Spectrum

To further study the scaling behavior of the eigenvalues in this context, we first order the eigenvalues such that and study the scaling of “frequency” spectrum of the eigenvalues, defined by Once we compute the frequency spectrum of the eigenvalues, , we study the distribution of this spectrum.

#### 4. Results

##### 4.1. Stock Similarity Matrices

We begin by calculating the correlation, affinity, and partial correlation matrices for each of the three datasets. Then, to create the random matrices, we shuffled the stock return data, for each of the three datasets separately. We repeated this shuffling 10 times and averaged the results into one single random matrix. The value and maximal and minimal eigenvalue computed for these shuffled matrices are summarized in Table 1.

An example of the different correlation matrices for a representative set of 147 stocks of the S&P data, and the corresponding random matrix, is presented in Figure 1. Throughout this section, we will demonstrate our results using the S&P dataset.

##### 4.2. The Eigenvalue Distributions

For each similarity matrix we compute its eigenvalues and then investigate their distribution. In Figure 2 we plot the probability distribution function (PDF) of the eigenvalues for the three similarity matrices (for the S&P dataset). We compare each PDF to the corresponding distribution of eigenvalues of the random matrix. The results obtained for the raw correlations are similar to those reported by Plerou et al. [3], who analyzed a similar set of stocks.

**(a)**

**(b)**

**(c)**

First, we compare the eigenvalues of the raw and affinity correlations (Figures 2(a) and 2(b)). We find that for both similarity measures, the first eigenvalue (the principal eigenvalue) is significantly larger than the rest (10 times larger for both cases). The principal eigenvalue is commonly associated with the market [3, 4, 11]. Comparing these two sets of eigenvalues, we note that the normalization process used to compute the affinity matrices results in a much larger principal eigenvector ( for the correlation and affinity matrix, resp.). Furthermore, in the case of the affinity matrices, there are less deviating eigenvalues above , for all three cases (Table 2).

Next, we turn to study the eigenvalues of the partial correlation matrices (Figure 2(c)). Studying this distribution, it is possible to observe that there are more eigenvalues that deviate from the noise, especially ones that are larger than (Table 2). Another immediate observation is that the principal eigenvalue is now significantly smaller than in the case of the correlation and affinity matrices. However, it is important to keep in mind that in this case the principal eigenvalue is no longer associated with the market, since the effect of the index (which is a representative of the market [2]) has been removed.

For the DJIA and TA datasets, we focus on the number of deviating eigenvalues from of the corresponding random matrix (for each dataset separately), for the raw correlation, affinity, and partial correlation matrices (Table 2). Here again we observe more deviating eigenvalues in the case of the partial correlations. For the raw correlation matrix, there are 3 and 2 such eigenvalues for the DJIA and TA datasets, respectively, and 2 and 1 for the affinity matrices. These include the principal eigenvalue. In contrast, in the case of the partial correlations, there are 6 and 5 such eigenvalues for the DJIA and TA datasets, respectively. This is in agreement with the results of the S&P dataset, where we observed more eigenvalues that deviate above for the case of the partial correlation matrix.

##### 4.3. Scaling Behavior of Stock Similarity Measures

To further investigate the eigenvalues of the different similarity matrices, we order the eigenvalues by rank and investigate whether they follow a Zipf's law like behavior [26, 27].

To test if the eigenvalue distributions display a Zipf Law behavior, we calculate the slope of each distribution presented in Figure 3. For the first 100 eigenvalues, we see a power law-scaling behavior, which disappears for smaller eigenvalues. We find that the slope equals , and for the raw correlation, affinity, partial correlation, and random matrices, respectively.

##### 4.4. Eigenvalue Spectra

Comparison of the eigenvalues spectra, as defined by in (3.3), is presented in Figure 4. For all three-similarity measures, it is clear that the first 10 values of are significantly different from those of the corresponding random matrices.

In Figure 5 we plot the distribution for each of the similarity measures against the distribution for the eigenvalues of the random matrix. Studying both Figures 4 and 5, we observe that as was discussed in the previous sections, for all three similarity measures we find eigenvalues that deviate from noise.

**(a)**

**(b)**

**(c)**

#### 5. Eigenvalue Spectral Entropy

The concept of eigenvalue entropy was used as a measure to quantify the deviation of the eigenvalue distribution from a uniform one [28]. The idea was first used in the context of biological systems [29, 30]. Here we use a similar concept to quantify the spectra rather than the eigenvalue distribution. The spectral entropy, SE, is defined as where is given by Note that the normalization was selected to ensure that SE for the maximum entropy limit of flat spectra (all are equal).

The resulting SE values are presented in Table 3. The lowest entropy reduction (in comparison to that of the corresponding random matrices), hence the highest value of embedded information, was obtained for the affinity matrices.

However, since the first eigenvalue for the raw and affinity correlations includes the effect of the index (the market), we perform additional comparison after first removing the first eigenvalue for the correlation and affinity matrices. The results are presented in Table 4.

Looking at Table 4, we note that in the case of the S&P, the entropy of the partial correlation matrix is now the lowest. This is consistent with the results presented above, derived from the RMT analysis. However, the DJIA and TA datasets display different behavior of the entropy after the removal of first eigenvalue. Comparison of the entropy values for DJA and TA with and without the first eigenvalue provides a quantification of the index effect on these different types of markets.

#### 6. Discussion

Understanding the similarity between stocks for a given portfolio is of the highest importance. Here we present a comparison of three different similarity measures: raw correlations, affinity (collective normalization) correlations, and partial correlations. The eigenvalue statistics and scaling of each of the matrices are compared with that of the eigenvalues of the corresponding random matrices. The investigation presented here was performed on datasets representing two different markets—the mature and large NYSE, and the young and small TASE.

The affinity matrix, which is the result of a collective normalization of the correlation matrix, has been found to highlight and emphasize cliques and important variables in the system [2, 21, 22]. The most significant difference between the affinity matrices and the raw correlation matrices is that in the affinity case, the first eigenvalue is 33% larger. Thus, the affinity transformation highlights the impact of the market on the stock correlations.

The partial correlation matrices represent the residual correlations after removing the mediating effect of the index [2]. Garas and Argyrakis [11] used RMT to study the eigenvalue distribution of the Athens Stock Exchange and subtracted from it the contribution of the first eigenvalue, by normalizing the eigenvalues of the raw correlations. Using partial correlations, as was suggested here, allows to first remove the effect of the index on all the correlations between stocks, and then study the eigenvalue distribution of the matrix.

We found that, in the case of the partial correlations, there are more eigenvalues that deviate from those of the corresponding random matrix (Table 2). It is common to associate the eigenvectors of the deviating eigenvalues with economic sectors (see, e.g., [9]). This has only been partially successful in the past. In the case of the partial correlation matrices, the first eigenvalue is no longer associated with the market mode, since the effect of the market was removed (by eliminating the effect of the index). This led us to the simple intuitive hypothesis that the first eigenvector of the partial correlation matrix will be very similar to the second eigenvector of the raw and normalized correlations. However, this turned out not to be the case, and such similarity was not observed for all three datasets. In the S&P dataset, we found that the first eigenvector of the partial correlation matrix mainly includes stocks from the technology sector, however not exclusively. These findings led us to the conclusion that the association of the eigenvectors of the stock similarity matrices, especially in the case of partial correlations, demands a more rigorous investigation, which will be presented in the future.

To extend our analysis of the eigenvalues of the different similarity measures, we turned to study the full set of eigenvalues of each matrix. Following the notion of Zipf law for text [26, 27], we plotted the values of the ordered eigenvalues as function of their rank, using a log-log scale. This simple presentation was found to be quite powerful in distinguishing between the different similarity measures. The eigenvalue Zipf plot revealed a power law-scaling behavior with a higher slope for the raw and affinity correlations. This analysis tool was found to be especially useful in the investigation of the DJIA and TA datasets.

Next, we investigated the information content in the eigenvalue spectra (in analogy to the energy level gaps, which were originally addressed by random matrix theory). For such a purpose, we used a Shannon entropy-like measure [29, 30] that quantifies the deviations from the limit of maximum entropy (lowest information), which is found for the case of flat spectra. The spectral entropy (SE) revealed that the affinity matrices contain the largest amount of information (minimum spectral entropy). However, with the entropy, after removing the first eigenvalue (that corresponds to the market mode) in the raw correlation, and affinity matrices, we found that the partial correlations contain more information for the S&P stocks. For the DJIA and TA, the affinity correlations contain the largest amount of information, even after the removal of the first eigenvalue. This phenomenon might indicate a weaker effect of the index in the smaller markets; however more work is required to clarify this issue. Furthermore, after removing the first eigenvalue, the entropy of the TA stocks is barely changed, unlike that of the DJIA and S&P stocks. This shows that the two markets (NYSE and TASE) have a different dependency on the market mode. However, to fully understand this observation, it is important to study this phenomenon for other markets, of different sizes and maturity stages.

We have used different measures to investigate the information embedded in different stock similarity matrices. On the one hand, we study the eigenvalues of the different similarity matrices, by comparing them to the regime of random noise, and by studying their scaling behavior and the scaling behavior of their frequencies. On the other, we investigate the information content using a measure of entropy. In the case of the former, we have found that more information is embedded in the partial correlation matrices, where we have found more eigenvalues that deviate from the regime of the noise. In the case of the latter, we have found that more information is embedded in the affinity matrices.

At this stage it is unclear why the two approaches did not fully agree on the most informative similarity measure. A possible explanation is that RMT analysis focuses on a small subset of the eigenvalues, while the entropy analysis focuses on the full set of eigenvalues, and thus the two forms of information are slightly different. However, while each approach emphasized a different similarity measure as being the most informative, neither found the raw correlations to be as such. Thus, we propose the use of the other two similarity measures for the study of stock similarities in a given portfolio.

In conclusion, in this preliminary work we show that important latent market information is embedded in the affinity and partial correlation matrices. This finding has important implications for portfolio optimization and stock risk assessment, and future work should be devoted in these directions. Finally, using these tools to extract meaningful information from stock similarity matrices, we have shown a significant difference between the two types of markets studied here. To fully utilize this result, this work must be expanded to include many other markets. Doing so should elucidate the full picture regarding the difference in information content for different types of markets.

#### Acknowledgments

The authors would like to thank Rosario Mantegna and Michele Tumminello (Palermo University, Italy) for many fruitful conversations and insights on the subject. They have also greatly benefited from discussions with H. Eugene Stanley (Boston University). Finally, they would like to thank Gitit Gur-Gershgoren (Israel Securities Authority) for many fruitful conversations regarding the Tel Aviv Market. This research has been supported in part by the Tauber Family Foundation and the Maguy-Glass Chair in the Physics of Complex Systems at Tel Aviv University.