#### Abstract

This paper builds an evolution model of investors behavior based on the reinforcement learning in multiplex networks. Due to the heterogeneity of learning characteristics of bounded rational investors in investment decisions, we consider, respectively, the evolution mechanism of individual investors and institutional investors on the complex network theory and reinforcement learning theory. We perform mathematical analysis and simulation to further explain the evolution characteristics of investors behavior. The conclusions are drawn as follows: First, the intensity of returns competition among institutional investors and the forgetting effect both have an impact on the equilibrium of their evolution as to all institutional investors and individual investors. Second, the network topology significantly affects the behavioral evolution of individual investors compared with institutional investors.

#### 1. Introduction

There are significant uncertainty and unpredictability in the stock market due to the complicated factors including investors’ bounded rationality and policy correlation effects. Investors tend to frequently trade and blindly follow the volatility of stock market, which can easily react upon the market exacerbating the instability [1–4]. Therefore, it is of great significance to study the evolution of investor behavior in the stock market. The research is more effective in grasping the behavior evolvement rule of stock market.

In recent years, reinforcement learning has become a cutting-edge application in the financial field with the development of big data and the increasing demand for financial data analysis. At present, many literatures have applied reinforcement learning to the research on high-frequency trading, investment portfolio, and other fields. Jangmin et al. [5] proposed a new stock trading method combining dynamic asset allocation and reinforcement learning. Shimokawa et al. [6] introduced nonlinear effects into the ordered TD-type learning model and constructed an augmented TD-type learning model to predict investors’ behavior. Bekiros [7] proposed an adaptive fuzzy Actor-Critic reinforcement learning and built a financial trading system based on it. Tan et al. [8] constructed an arbitrage-free algorithmic trading system with the adaptive network fuzzy inference system (ANFIS) based on reinforcement learning. Bertoluzzo and Corazza [9] applied Q-learning and Kernel-based RL algorithms to automated financial trading. Feuerriegel and Prendinger [10] developed an automatic decision-making system with the news-based sentiment data and price momentum based on the supervision and reinforcement learning. Pendharkar and Cusatis [11] constructed a pension investment portfolio based on reinforcement learning. Yang et al. [12] believed that investor sentiment is an important factor affecting market returns, so they designed a trading system based on investor sentiment reward by Gaussian inverse reinforcement learning.

Network science theory is important to analyze investor behavior and it is a supplement to traditional statistics and experimental economics [13–15]. In recent years, many scholars have studied the investment behavior in the stock market from the perspective of network science. Existing research is divided into three aspects: analysis on investor network topology, empirical description of investment behavior, and research on investor behavior under the interaction mechanism.

At present, some scholars have analyzed the topological structure of financial network through using actual financial data and found that these financial networks have the characteristics including disassortative architectures [16], small-world properties [17], power-law degree distribution [18–20], and community structure [21, 22]. Through the study on the network structure, some scholars further analyzed the investment behavior in the financial network. Long et al. [23] and Baumohl et al. [24] studied the volatility spillover effect of the stock market based on the stock network. Pareek [25] constructed a network model of mutual fund interconnection with holding the same stock as the “link” and revealed the inherent formation mechanism of fund herding behavior. Chung et al. [26] constructed an empirical investor network with the account transaction data of investors in the Taiwan Stock Exchange from 2005 to 2014 and analyzed the impact of information dissemination on stock returns.

As the actual financial data is not completely available, methods such as theoretical analysis and simulation have become important choices for many scholars to study the investor behavior in financial network. Wang et al. [27] divided the investors into three types including experts, speculators, and followers, to study how the information quality of experts affects investor behavior in the social network. Bian et al. [28] constructed an evolutionary model of investor behavior in the stock market based on network coordination game and studied the impact of investor behavior on financial market. Bakker et al. [29] and Stefan et al. [30] established an evolution model of investors’ investment behavior based on the investor’s social network to analyze investor behavior and its impact on the stock market. Xu et al. [31] proposed a weighted game model in investor network and found that the investor behavior evolution in WS small-world network is more stable than that in BA scale-free network. Khashanah and Alsulaiman [32] analyzed the performance of their investment behavior on the scale-free network by taking four types of investors, basic strategy traders, momentum traders, adaptive strategy traders, and zero-intelligence traders, into consideration. Cohen et al. [33] studied the information transmission in the financial market by applying a social network. Krichene and El-Aroui [34] divided investors into informed and uninformed investors, constructed a scale-free social network model, and analyzed the information asymmetry and herd behavior in different stock markets. Wang and Pan [35] built an artificial stock market model based on the heterogeneous information strategies in a dynamic scale-free network and studied the evolution of investors’ information strategies.

In recent years, with the development of network science theory, many scholars have changed from a single-layer network perspective to a multi-layer network perspective to study investor behavior in stock market. Li et al. [36] analyzed the volatility of the stock market and the herd behavior of investors by using a bipartite network model of stocks and traders. Gao [37] explored the impact of margin trading on stock price volatility based on a double layered network of stocks investors. D’arcangelis et al. [38] analyzed the investment style of Italian pension fund by using a bipartite network model. Paulin et al. [39] constructed a double layered network of funds and assets and analyzed the reasons for the stock market crash. Lu et al. [40] studied the “herd effect” caused by the pursuit of portfolio diversity based on a bipartite network of funds and stocks. Biondo et al. [41] built a double layered financial network model to simulate information dissemination and investor transactions in the market. Wang and Chen [42] constructed the transaction model with a heterogeneous information in a double layered social network by introducing the double-factor interaction function. Souza and Aste [43] established a multi-layer network model by using social media data and financial data to predict the future stock market structure.

In conclusion, there are many existing literatures studying investor behavior in stock market with reinforcement learning theory, but these literatures are mainly based on a single-layer network model and a single reinforcement learning mechanism, ignoring the heterogeneity of investor network structure and investor learning. Therefore, this paper divides investors in the stock market into the individual ones and institutional ones. Then, we analyze the decision-making mechanism of individual investors and institutional investors using the complex network theory and reinforcement learning theory. Based on it, the evolution model of investors behavior in the multi-layer network is constructed. Finally, this paper is investigated by mean-field analyses and simulations on the evolution characteristics of investors behavior．

#### 2. The Model

##### 2.1. Evolutionary Mechanism of Investor Behavior

Due to factors such as information asymmetry, the professionalism of investors, and differences in investment goals [44, 45], bounded rational investors are less likely to make optimal decisions only by observing the current stock market conditions. Experience is the easiest way to judge the feasibility of investment decisions. Before making the next investment decision, an investor usually refers to the returns obtained in the past. Even if the environment has changed, it will also repeat the strategy with better returns in the previous period [46, 47]. Therefore, the experience-based reinforcement learning method will directly affect investors’ investment strategies. However, there are significant differences between individual investors and institutional investors. Compared with individual investors, better performance can bring more capital inflows, which helps institutional investors achieve their goals, so there is intense performance competition among institutional investors [48]. Therefore, this paper divides investors into institutional investors and individual investors, the evolution model of investment behavior under different reinforcement learning strategies in network is constructed, as shown in Figure 1.

Based on the above analyses, the decision-making mechanism plays a crucial role in the evolution of investor behavior. In view of the differences between individual investors and institutional investors in investment decision-making, the learning strategies of investment behavior evolution of institutional investors and individual investors are constructed, respectively.

##### 2.2. Evolutionary Model of Investor Behavior

Suppose as the stock market, where represents the investor and means the connection between investor and investor . refers to the degree of node ; represents degree distribution. Meanwhile, investors in the stock market are divided into institutional investors and individual investors, and the proportion of institutional investors in all investors is , so the proportion of individual investors is . In addition, it is assumed that there are three investment behavior states in stock market: , where means “sell,” means “hold,” and means “buy.” Based on this, the state of all investors in the stock market can be expressed as at any time.

is set to indicate the number of individual investors which are associated with investor , and is set to indicate the number of institutional investors which are connected with investor ; then . Among them, it can be assumed that there are buyers, sellers, and holders of institutional investors associated with investor , and . Similarly, it can be assumed that, among the individual investors associated with investor , there are buyers, sellers, and holders, and .

###### 2.2.1. Evolution Rules of Individual Investors Behavior

①Individual investors reinforcement learning strategies: Strahilevitz et al. [49] found that investors tend to repurchase the stocks they previously sold for a gain while ignoring the stocks they previously sold for a loss. Linnainmaa [50] found that household trading intensity depends on past performance. In the stock market, the investment performance of individual investors is largely affected by personal experience. In particular, investors tend to pay more attention to some successful experiences, even if such experience cannot be replicated in the future [51, 52]. These findings can be explained by reinforcement learning models. Individual investors will adjust their behavior according to the changes in the stock market at any time. On perceiving the changes of the market, they will take corresponding actions, such as buying, selling, or holding. The market will give certain feedback based on actions. If the action is successful, the probability of it being selected will increase [53]. Kaustia and Knüpfer [46] found that, compared with Bayesian learning theory, reinforcement learning theory can better explain the positive correlation between investors’ past IPO returns and future subscriptions. Choi et al. [54] used reinforcement learning theory to explain the savings behavior of individual investors. Erev and Roth [55] established a reinforcement learning model, which can well describe the evolution of individual behavior in economic experiments. Therefore, this paper refers to Erev and Roth’s reinforcement learning model to fit the evolution rules of investment behavior of individual investors. Suppose represents the attractiveness of strategy to investor at time . Forgetting parameter is used to gradually weaken the influence of past experience, and always holds. represents the return of investor at time . When the decision of investor is the same as the optimal decision , his/her returns . Otherwise, . The investor’s strategy update rules are as follows: ②Individual investment decisions: based on the above analyses, the probability that individual investor chooses to “buy” is . The probability that individual investor chooses to “sell” is . The probability that individual investor chooses to “hold” is .

###### 2.2.2. Evolution Rules of Institutional Investors Behavior

Greenwood and Nagel [56] analyzed the performance of experienced and inexperienced fund managers during the bubble. They found that inexperienced managers would buy technology stocks in large quantities during the run-up and sell them during the downturn. Hence, similar to individual investors, institutional investors also rely on reinforcement learning in trading. The historical performance of institutional investors will influence the capital flow and investment choice of individual investors. In order to maximize profits, there is a competitive relationship among institutional investors [48]. Compared with individual investors, institutional investors not only have reinforcement learning behavior in the decision-making, but also have game behaviors among institutional investors, namely, belief learning (game learning). EWA learning model [57] contains both reinforcement learning and belief learning. Therefore, we construct an evolution model of institutional investor behavior based on the EWA learning model: ①Institutional investors reinforcement learning strategies: based on the EWA learning model, is defined as the attractiveness of strategy to investor at time without considering the influence of other investors’ decisions, and is satisfied. Suppose represents the actual return if institutional investor chooses strategy at time . is defined as the forgetting effect coefficient of investors. The larger the value of is, the less the influence of past experience is, and is satisfied. The rule for institutional investor to update the strategy based on reinforcement learning is ②Institutional investment decision: suppose represents the strategy chosen by investor at time , represents strategy chosen by investor at time , and is the indicator function; when , , and when , . is a vector that represents the set of strategies adopted by all the other adjacent investors of investor at time . The return obtained by investor choosing strategy is when the set of strategies adopted by other investors is , . is used to represent the intensity of returns competition among institutional investors; is satisfied. Based on this, the decision function of institutional investor is

According to the above analyses, the probability that institutional investor chooses to “buy” is . The probability that institutional investors choose to “sell” is . The probability that institutional investors choose to “hold” is .

#### 3. Mathematical Analysis

Assuming that the degree of institutional investors is at time , the proportion of investment decision to buy is , the proportion of investment decision to sell is , the proportion of the hold is , and . Assuming that, at time , the degree of individual investors is , the proportion of investment decision to buy is , the proportion of investment decision to sell is , and the proportion of choosing to hold is , and .

##### 3.1. Mathematical Analysis on the Evolution of Institutional Investor Behavior

Assume that represents the probability of connecting to institutional investors whose investment decision is buying; . Assume that represents the probability of connecting to institutional investors whose investment decision is selling; then . Assume that represents the probability of connecting to institutional investors whose investment decision is holding; . Thus, the probability that there are exactly buyers and sellers in the institutional investors connected to investor is

So, the probability of institutional investor choosing to buy is

The probability of choosing to sell is

The probability of choosing to hold is

The rate of change that institutional investors buy stocks is

According to the mean-field equation, the rate of change that institutional investors buy stocks is equal to the probability that the non-buyer converted into the buyer minus the probability of the buyer becoming a non-buyer. When the value of the above formula is equal to 0, the network reaches a steady state. We can infer

According to equation (10), combined with , the values of and in the equilibrium state of institutional investors behaviors can be determined:

However, it can be seen from the above analyses that it is difficult to see the impact of key elements such as network topology and learning strategies on the equilibrium state of the evolution of institutional investor behavior, so further discussion is needed.

According to , it can be seen that investors will choose to buy when and , so

In equation (12), is not affected by , so it mainly considers the influence of on . The values of do not change when and , but the change is uncertain when and . Based on this, it is assumed that , , and investors’ decision-making is not affected by when and , so analysis is not required. Therefore,with the increase of , is not strictly decreasing; that is, is not strictly decreasing. However, it is difficult to obtain the influence of forgetting effect on the stable state of institutional investor behavior through mathematical analysis, so this paper will analyze it by simulation.

##### 3.2. Mathematical Analysis on the Evolution of Individual Investor Behavior

Assume that represents the probability of connecting to individual investors whose investment decision is buying; . Assume that represents the probability of connecting to individual investors whose investment decision is selling; . Assume that represents the probability of connecting to individual investors whose investment decision is holding; . Thus, the probability that there are exactly buyers and sellers in the institutional investors connected to investor is . Similarly, the probability that there are exactly buyers and sellers in the individual investors connected to investor is . The probability of the above two events occurring at the same time is

So, the probability of individual investor choosing to buy is

The probability of choosing to sell is

The probability of choosing to hold is

The rate of change that individual investors buy stocks is

According to the mean-field equation, the rate of change that individual investors buy stocks is equal to the probability that the non-buyer converted into the buyer minus the probability of the buyer becoming a non-buyer. When the value of the above formula is equal to 0, the network reaches a steady state. We can infer

According to equation (19), combined with , the values of and in the equilibrium state of individual investors behaviors can be determined:

However, it can be seen that the influence of learning strategy, forgetting effect, initial state, and other factors on the equilibrium state of individual investor behavior cannot be obtained through mathematical analysis. Therefore, this paper will analyze it by simulation.

#### 4. Simulation Results

The previous part has theoretically analyzed the evolution mechanism of investor behavior in the stock market under the reinforcement learning. It can be seen from the analysis that the evolution of the stock market investor behavior on the reinforcement learning is mainly affected by factors such as the network structure, the intensity of returns competition among institutional investors , and forgetting parameters . On this basis, this section mainly employs simulation, to further explain the evolution process of investor behavior. According to the statistical yearbook data of the Shanghai Stock Exchange, the net sales of various types of investors in the past five years are as follows: the proportion of natural person investor transactions is 80% to 85%, the proportion of transactions of general legal persons, Shanghai Stock Connect, and professional institutions ranges from 15% to 20%. Based on this, this paper sets the proportion of institutional investors in the network to . If the total number of investors is set to 500, the number of individual investors is 400 and the number of institutional investors is 100. Existing research shows that the network of stock investors based on financial correlation exhibits power-law distribution characteristics [18–20]. Based on this, a two-layer scale-free network is constructed, and the average degree of network is different at different layers. The edges of different types of investors are established according to degree-dependent connection and random connection, respectively, in this paper. In the initial state, all investors are in the “holding” state. The evolution time is 300. The default values of all parameters are , , , and the average degree of the entire network is . This paper assumes that the market is generally in a good state.

##### 4.1. Impact of Network Structure on the Evolution of Investor Behavior in Stock Market

Figure 2 compares and analyzes the dynamic evolution of the two types of investor behavior in two network structures of random connection and degree-dependent connection when the network average degree is and , respectively.

**(a)**

**(b)**

**(c)**

**(d)**

Figure 2 describes the impact of network connectivity and network heterogeneity on the evolution of investor behavior. For individual investors, Figures 2(a) and 2(b) show that, in BA-BA randomly connected multi-layer networks, the time for the individual investor’s behavior evolution to reach a steady state is when and the time is when . It indicates that the time for investment behavior to evolve into steady state shortens as the average degree increases. Figures 2(c) and 2(d) show that, in a BA-BA degree related connection multi-layer network, the time for the evolution of individual investor behavior to reach steady state is when and the time is when .This shows that network connectivity and network heterogeneity have an impact on the evolution of individual investor behavior.

For institutional investors, in any case, the dynamic evolution of their behavior quickly reaches a steady state, and the realization time and the state ratio are basically the same. It can be seen that the network topology significantly affects the behavioral evolution of individual investors, but it has a weaker impact on institutional investors.

##### 4.2. Impact of the Intensity of Returns Competition among Institutional Investors on the Evolution of Stock Market Investor Behavior

In the case where all the parameter values are the same, Figure 3 analyzes the evolution of the investment behavior of institutional investors and individual investors under different intensity of returns competition among institutional investors .

**(a)**

**(b)**

As shown in Figure 3, under the two network structures, for institutional and individual investors, the intensity of returns competition among institutional investors has an impact on their investment behavior. For institutional investors, with the increase of , the proportion of institutional investors “buying” has continued to decrease, and the proportion of “holding” has continued to increase, while the proportion of “selling” has not changed significantly. Compared with institutional investors, with the increase of , the proportion of individual investors “holding” and “selling” has increased slightly, and the proportion of “buying” has continued to decrease.

##### 4.3. Impact of Forgetting Effect on the Evolution of Investor Behavior in Stock Market

This section mainly analyzes the impact of forgetting effect coefficient on the dynamic evolution of investor behavior in two network structures, respectively. Other experimental parameters are set with the above default values. The simulation results are shown in Figure 4.

**(a)**

**(b)**

As shown in Figure 4, under two network structures, with the increase of the forgetting coefficient , the proportion of individual investors “buying” is continuously decreasing, and the proportion of “holding” and “selling” increases slightly. Compared with the inter-layer random connection network, the proportion of individual investors “buying” is higher in the inter-layer degree-dependent connection network. Unlike the behavior of individual investors that is obviously affected by the forgetting effect, institutional investors are relatively less affected by the forgetting coefficient. As the forgetting coefficient increases, the proportion of institutional investors “buying” decreases to a lesser extent. In summary, with the increase of the investor’s forgetting effect coefficient, the proportion of investors who “buy” at equilibrium is reduced; compared with institutional investors, the proportion of “buyers” of individual investors decreases even more. That is, the forgetting effect has a greater impact on individual investor behavior.

##### 4.4. Impact of Initial Investment Behavior State on the Evolution of Investor Behavior in Stock Market

This section mainly discusses the effect of the initial investment behavior of investors on the dynamic evolution process of their behaviors. The parameters in the simulation refer to the above default values.

Figures 5(a)–5(d) describe the effect of the initial behavioral status of investors on the dynamic evolution of their behaviors in the two types of network structures with average degrees of and , respectively. It can be seen from the figure that as the proportion of investors’ initial investment behaviors “buy” status gradually increases, the ultimate equilibrium state of institutional investor and individual investor investment behavior evolution are basically the same. It shows that, whether it is an institutional investor or an individual investor, the initial investment behavior status has little effect on its behavior evolution.

**(a)**

**(b)**

**(c)**

**(d)**

#### 5. Conclusion

In this paper, we analyze the decision-making mechanism of individual investors and institutional investors on the theories of complex networks and reinforcement learning and construct the evolution model of investors behavior based on the reinforcement learning in network. Then, the evolution characteristics of investors behavior are investigated by mean-field analysis and simulations. The conclusions are drawn as follows: (1) the topology of the network has a greater influence on the evolution of the individual investor behavior than that of the institutional investor. (2) The equilibrium state of institutional investors behavior is in nonlinear correlation with the intensity of returns competition among institutional investors. Moderate competition for returns among institutional investors is conducive to the stability of the stock market. (3) As to all institutional investors and individual investors, the forgetting effect has a significant influence on the equilibrium of their evolution. Through reinforcement learning, investors can obtain useful investment advice from a large amount of market information, choose which information is more worthy of attention, and to a certain extent eliminate the influence of behavioral bias on investment, but at the same time, the successful experiences will also strengthen the irrational characteristics of investors, such as overconfidence, etc. In a bull market, investors often attribute their success to their investment ability, which makes them more overconfident and further leads to irrational behaviors. Therefore, investors should view their own experience as rationally as possible when making investment decisions and establish the belief of long-term investment. (4) The initial state of the investors behavior has a small influence on the equilibrium of their evolution for all institutional investors and individual investors.

This paper mainly studies the evolution of investor behavior based on the static network structure. However, the network structure is bound to evolve with the change of investment subject’s behavior. Therefore, it will be the authors’ future work to analyze the evolution of investor behavior under dynamic network structure.

#### Data Availability

The data used to support the finding of this study are included within the article.

#### Conflicts of Interest

The authors declare no conflicts of interest.

#### Acknowledgments

This work was supported by the National Planning Office of Philosophy and Social Science (Grant no. 19BJY255), project supported by MOE (Ministry of Education in China) Youth Foundation Project of Humanities and Social Sciences (Grant no. 17YJC790002), Sponsorship of Jiangsu Overseas Visiting Scholar Program for University Prominent Young & Middle-Aged Teachers and Presidents in the year of 2018, and Teaching Reform Project of Nanjing Normal University in the year of 2019 (Grant no. 2019NSDJG025).

#### Supplementary Materials

The supplementary material contains the source code of the manuscript.* (Supplementary Materials)*