Abstract

Illegal insider trading identification is of great significance to the healthy development of the securities market. However, with the development of information technology, problems such as multidata sources and noise bring challenges to the insider trading identification work. Moreover, most of the current research on insider trading identification is based on single-task learning, which treats enterprises in different industries as a whole. This may ignore the differences between insider trading identification in different industries. In this article, we collect indicators from multiple sources to help regulators identify insider trading and then use information gain and correlation analysis to screen the indicators. Finally, we propose a multitask deep neural network with insider trading identification in different industries as different subtasks. The proposed model takes into account the correlations and differences between different tasks. Results of experiments show that compared with logistic, support vector machine, deep neural network, random forest, and extreme gradient boosting model, the proposed model can identify insider trading of enterprises in different industries more accurately and efficiently. This article provides new ideas for market regulators to maintain the order of the securities market through intelligent means.

1. Introduction

The securities market is an essential component of the financial market and has contributed significantly to the development of the world economy. However, due to asymmetric information, inadequate corporate governance structures, and poor regulatory mechanisms, illegal insider trading occurs frequently [1, 2]. Insider trading not only seriously undermines the fairness of securities trading, which has a huge impact on stock price volatility but also distorts the responsiveness of market prices to asset values and reduces the efficiency of asset allocation in the securities market [3, 4].

Governments have never ceased to combat and regulate insider trading, but the regulatory approach is mainly ex-post, with a significant lag [5]. How to strengthen the ex-ante prevention and timely stop enterprises from insider trading has become an important issue in maintaining market order. We should make full use of artificial intelligence technology to select reasonable regulatory indicators and establish scientific regulatory models to enhance the identification of insider trading.

Scholars select a variety of indicators to assist in the identification of insider trading. In general, stocks with insider trading have significantly nonzero excess returns [6]. The occurrence of insider trading increases the bid-ask spread of stocks and reduces the liquidity of the securities market [7]. Therefore, stock performance facilitates scholars to identify the presence of insider trading. Some scholars have studied insider trading in combination with corporate governance structures and found that the probability of insider trading occurring is related to the size of shareholders’ rights. The greater the shareholders’ rights, the greater the probability of insider trading occurring [8]. Therefore, banning insider trading helps maximize the value of the company and the interests of small-sized and medium-sized shareholders [9]. In addition to shareholding rights, executive compensation is also closely related to the occurrence of insider trading [10]. Companies with good governance can effectively limit the use of insider information by insider informants, thereby constraining the occurrence of insider trading [11]. In addition, the company’s financial indices can help scholars determine whether insider trading has occurred [12]. Song and Li [13] selected 59 indicators to identify insider trading in terms of corporate governance, stock performance, and financial indices and found that seven indicators, such as the extraordinary cumulative return, were more effective. In addition to the above factors, the media’s attention to listed companies can play an external monitoring intelligence, reducing the occurrence of irregular trading behaviors [14]. With the development of information and communication technologies, a large number of text documents are available on the web, such as media news reports about companies, which can be used as an important source of information for decision making [15]. We can use sentiment analysis to extract structured and informative knowledge from unstructured text and to classify text documents as positive, negative, and neutral [16]. Therefore, the use of news data can provide valuable information to decision makers for insider trading identification.

In early studies on insider trading identification models, scholars mostly used the event study approach to identify insider trading, that is, to identify the existence of insider trading by observing whether the market reacts to information in advance [17]. Some academics have also used models such as logistic and Probit to identify insider trading and have found these models to be more effective than the event study method [18]. With the development of artificial intelligence, machine learning algorithms are widely used to build robust and effective discriminatory and classification systems, such as random forests, fuzzy-rough nearest neighbor algorithms, and neural networks, and are widely used in economics, physics, medicine, and computer literature [19, 20]. With the help of machine learning, experts can identify insider trading more accurately. Now, methods such as decision trees, support vector machines, and neural networks have been gradually applied to insider trading identification [21, 22]. Zhang [23] used a support vector machine to identify insider trading, compared it with the traditional logistic method, and found that the support vector machine model predicted better. Deng et al. [21] proposed an identification method that integrates XGboost and NSGA-II for insider trading regulation.

An ex-ante identification model of insider trading based on indicators, such as stock performance, corporate governance, and financial indices, can track suspicious securities accounts promptly, but the selection of high-dimensional indicators is quite challenging. In addition, most studies on insider trading identification are based on single-task learning and treat enterprises in different industries as a whole. However, there are differences in the degree of development, business management activities, accounting, and financial management of different industries. These differences may have different effects on the insider trading activities of enterprises in different industries. If the data sets of different industries are modeled as a whole, the differences among insider trading of enterprises in different industries may be ignored; if the data sets of a single industry are modeled only, the correlations may be ignored. Single-task learning ignores possible relationships between different tasks, whereas multitask learning can take into account correlations and differences between tasks and improve the generalization ability of the model by exploiting the domain-specific information contained in the relevant tasks [24, 25]. Also, neural networks have certain advantages in dealing with nonlinear problems [26], so we propose a multitask deep neural network model for insider trading identification.

The contributions of our presented research are as follows: First, we collect data on stock performance, corporate governance, financial indices, and media coverage about enterprises using web scraping and text mining techniques, to enhance the model’s ability to identify insider trading. Second, for the high-dimensional indicators, this research conducts feature selection based on information gain and correlation analysis. Third, taking the identification of insider trading in different industries as different subtasks, we propose a multitask deep neural network model to identify insider trading, and the proposed model is comprehensively compared with logistic, support vector machine (SVM), deep neural network (DNN), random forest (RF), extreme gradient boosting (XGBoost) models.

2. Methods

2.1. Indicators Screening Based on Information Gain and Correlation Analysis

Information gain has a strong performance in indicator variable selection [27]. To improve the operational efficiency and performance of the model, for the high-dimensional indicators, the information gain method is used to initially screen out the indicators with strong discriminatory ability, and then, correlation analysis is performed to further optimize the indicator set. The specific steps are as follows:Step 1: Calculating the information gain for indicator : The information gain of indicator is defined as Gain (), it is calculated based on information entropy and conditional entropy. The higher the information gain value of the indicator, the more classified information it carries, indicating that the indicator is more capable of distinguishing companies with insider trading.Step 2: Calculating the percentage of information gain for indicator : Let be the percentage of information gain for indicator , be the number of indicators, thenStep 3: Calculating the cumulative percentage of information gain: The information gain percentage are sorted from largest to smallest, noted as ,... . Let be the cumulative percentage of information gain for the first p indicators, then we haveThe accumulation is stopped when reached the threshold K, the corresponding p indicators are retained. In this research, the threshold K is chosen as 60%, 70%, and 80% to comparative analysis in subsequent experiments.Step 4: Performing correlation analysis for the retained indicators: Among the pairs of indicators with high correlation, the indicators with small information gain values are deleted to avoid the duplication of information reflected between indicators.Let denotes the correlation coefficient of the kth indicator with the jth indicator; denotes the standardized data of the kth indicator of the ith stock; denotes the mean of the standardized data of the kth indicator; denotes the standardized data of the jth indicator of the ith stock; denotes the mean of the standardized data of the jth indicator. Then,A larger value of indicates a greater correlation between the kth indicator and the jth indicator.Step 5: The final set of indicators is constructed by taking the ensemble of the screened indicators from different industries. The calculation process is shown in Figure 1.

2.2. Multitask Deep Neural Network

Multitask learning, a promising branch of machine learning, has been widely used in computer vision, natural language processing, disease prediction [2832], and other fields. We propose a multitask deep neural network model to identify insider trading, as shown in Figure 2. It is applied by sharing the hidden layers across all tasks while keeping several task-specific output layers to ensure the uniqueness of each task. Hard parameter sharing greatly reduces the risk of overfitting the model [25].

The model consists of a data input layer, two shared hidden layers, and task-specific output layers. Given that deep neural networks are slow to learn and prone to overfitting, we use the dropout method for structural refinement of shared layer 1 to enhance the generalization of the model. That is, a certain percentage of hidden layer nodes are randomly discarded during forwarding propagation, as shown in the circle marked by the dashed line in Figure 2. The forward propagation of the network is calculated aswhere is the output of the lth hidden layer, is the input of the lth hidden layer, and are the weight and bias matrices, respectively, and is the Sigmoid activation function. Then, multitask loss functions are jointly optimized, as shown in formula (5).where is the error function that measures the difference between the predicted and true values of the model for the kth task. In this article, a cross-entropy loss function is chosen for the classification problem.

In the error back-propagation process, we use the Adam optimization algorithm to update the weights, which has the characteristics of fast and stable convergence [33]. Assuming that at moment t, the Adam algorithm is calculated aswhere is the learning rate, and and are first moment estimate and second raw moment estimate, respectively. and are exponential decay rates for the two moment estimates.

Multitask learning has the advantages of data amplification, eavesdropping, offsetting some of the noise, and preventing overfitting, especially in the case of small sample sizes [25]. Multitask deep neural networks improve the identification of enterprise insider trading by building shared hidden layers and task-specific layers that take into account the correlation and difference of different tasks.

To evaluate the classification effect of the model, we select four common statistical indicators, including accuracy, recall, F1-score, and AUC. Accuracy represents the proportion of the observations that are predicted correctly. The recall rate represents the probability that a true positive sample is predicted as a positive sample; F1-score is the weighted average of precision and recall. The AUC is the area under the ROC curve; the higher the area under the curve, the better the predictions are. Table 1 shows the confusion matrix, and the equations for each evaluation indicator are shown in formulas (7)–(9).

3. Data and Indicators

The research takes the Chinese securities market as an example. The samples contain two categories: one is positive samples, i.e., companies in China’s manufacturing industry that had insider trading and have been notified and punished by the regulator from 2001–2019; the other is negative samples, i.e., companies that have had the same type of significant events and have not been punished by the regulator. Positive samples match negative samples, and the matching principles are as follows: first, the same exchange; second, the same industry; third, the same year of significant events; fourth, the same company size.

Based on the number of manufacturing subindustries and the number of enterprises with insider trading, we select five manufacturing subindustries, including computer communication and electronic equipment manufacturing, pharmaceutical manufacturing, chemical materials and products manufacturing, electrical machinery and equipment manufacturing, and special equipment manufacturing. The data category ratio for positive and negative samples is 1 : 3.5. To address the problem of small sample data and category imbalance, we use the SMOTE method [34] to oversample the positive samples in the experiment. Finally, the data set contains 998 samples, including 499 negative samples and 499 positive samples.

Through literature survey and theoretical research, we divide the indicators of insider trading identification into four categories. One is the stock performance, which is the most intuitive indicator to identify insider trading. The second is the corporate governance. Generally, people involved in insider trading are mostly the controlling shareholders and executives of the company, mainly because such people tend to have the easiest access to insider information. The third is the financial indices, which is a relative indicator for enterprises to summarize and evaluate their financial status and operating results. Fourth, the media coverage; media coverage can perform an external monitoring function that may reduce the incidence of irregular trading practices. We use text sentiment analysis based on sentiment lexicon to process news data [35]. We found that 71% of the insider trading occurred within three months before the announcement of the material information after counting the sample data. Therefore, we select the media coverage of relevant companies in the three months before the date of the significant information announcement. Finally, we obtain a total of about 40,000 news reports. The indicators mentioned above are mainly from the RESSET database and the CSMAR database, and partly through web scraping. Finally, we initially select 60 indicators. The descriptions of the indicators are shown in Table 2.

4. Experimental Results

4.1. Feature Selection Based on Information Gain and Correlation Analysis

In this article, we collect indicators from multiple sources and further filtered the indicators to improve the model performance. We use information gain and correlation analysis to filter indicators for each of the five industry data sets according to the process in Figure 1, and finally, take the ensemble to construct the indicator set. The indicators are selected according to the threshold values of 60%, 70%, and 80%, respectively, and comparative analysis is performed to select the most appropriate indicator set.

Taking 70% as an example, Figure 3 shows the results of the rank of the information gain of chemical raw materials and products manufacturing, and 11 indicators with the top 70% of cumulative information gain are selected. Two pairs of indicators with strong correlation are obtained by correlation analysis, as is shown in Table 3, and then, we eliminate indicators with small information gain values in each pair, leaving 9 indicators. Similar operations are performed for the other four types of industries, and finally, the selected indicators from the five industry data sets are combined to obtain a total indicator set containing 19 indicators, as is shown in Table 4.

4.2. Comparative Analysis of Different Indicator Sets

Based on different thresholds, we construct three different sets of indicators and select the best one by comparative analysis. The data are randomly divided into training and testing sets in the ratio of 70% : 30%. We take the average of the ten rounds of test results. In addition, we mainly select DNN and MTL-DNN methods for comparative analysis, and the results are shown in Table 5. The dropout rate is 0.4. Adam optimization algorithm learning rate  = 0.001, exponential decay rate  = 0.9, and  = 0.9999.

In Table 5, Pool-DNN denotes the modeling results based on the overall industry data set by DNN; STL-DNN denotes the mean value of the results after modeling separately for five single industry data set by DNN. The evaluation indicators such as accuracy, recall, and AUC are the best when the threshold is 70%. So we finally select the set of indicators corresponding to the threshold value of 70% and use it when comparing the models below. It can be seen that the performance of the model was further improved using information gain and correlation analysis for index screening. The model performance of MTL-DNN is better than DNN under several different indicator sets, indicating that the model constructed in this article has strong insider trading identification capability and strong robustness.

4.3. Comparative Analysis of Different Models

To validate the model’s effectiveness, this article compares the proposed model with methods, such as logistic, SVM, DNN, RF, and XGBoost; Table 6 reports the results. Among several methods, the logistic has the lowest values of indicators, such as F1-score and recall. The model proposed in this article performs the best, with better evaluation indicators, such as accuracy, recall, F1-score, and AUC.

Given that the main objective of this research is to identify companies with insider trading, we focus on the recall rate. The recall rate means the percentage of companies with insider trading that are accurately identified. Table 6 shows that the recall rate of the MTL-DNN model is 90.6%, which is a significant improvement of 7.6% and 6.5% over STL-DNN and Pool-DNN. Compared with Logistic, SVM, DNN, RF, and XGBoost, the model proposed has better evaluation metrics, such as accuracy, recall, F1-score, and AUC. It indicates that the model proposed in this article was more accurate in insider trading identification.

4.4. Comparative Analysis of Models with Different Number of Industries

In this article, we select five subindustries of the manufacturing industry for the experiment. To test the effectiveness of the model constructed in this article after adding more other industries as subtasks, this section further compares the indexes of the MTL-DNN model with a different number of tasks. The number of tasks is 2, i.e., we randomly select two as tasks from five industry data sets, there are ten combinations in total, and then take the average value of the index results, and so forth.

As is shown in Figure 4, with the increase of the number of tasks, the evaluation indexes (i.e., F1-score, recall rate) of the MTL-DNN model show an increasing trend. It demonstrates that by adding more subindustries, more information can be shared between different tasks, which may yield better results. Therefore, the model proposed in this article can add more industries and has good scalability.

5. Conclusions

This article utilizes data from multiple sources to improve the research validity. For high-dimensional indicators, this article uses information gain and correlation analysis to filter indicators. Taking the identification of insider trading in different industries as different subtasks, this article constructs a multitask deep neural network model, which considers the correlation and heterogeneity among insider trading activities in different industries. Compared with traditional machine learning methods, such as Logistic, SVM, RF, and XGBoost, the model proposed in this article performs better with higher values of evaluation metrics, such as accuracy, recall, and AUC. This research is helpful to market regulators to improve their supervision accuracy and efficiency on insider trading identification. Moreover, it is of great significance to provide timely supervision by intelligent means.

The data set constructed in this article can be further enriched with application scenarios in the future to serve other fields, such as enterprise credit system construction. Second, although the multitask model constructed in this article is only applied to some of the industries in the experiment, it can be easily extended to more industries. Finally, we can use the model proposed in this article to try to solve other aspects of the companies’ identification problem for further research.

Data Availability

The data used to support the findings of the study can be obtained from the author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the National Social Science Foundation of China, grant no. 19BTJ023.