Abstract

Stroke is an acute cerebrovascular disease caused by the rapid rupture or blockage of intracranial blood vessels for a variety of reasons, preventing blood from flowing into the brain and causing damage to brain tissue. The global burden of stroke disease is quickly increasing, and ischemic stroke (IS) accounts for 60 percent to 70 percent of all strokes, owing to the prevalence of people’s bad lifestyles and the intensity of global ageing. Although most IS patients have received effective treatment, many patients still have certain dysfunction or death after treatment, and the recurrence rate is about 18%, which brings a heavy economic burden to society and families. Therefore, it is urgent to build a postoperative prediction model for IS, so as to take targeted clinical intervention measures, which has extremely important practical significance for improving the prognosis of IS. The following work has been done in this paper: (1) the theoretical background for the BP prediction model and logistic regression prediction model suggested in this work is offered, as well as the research progress and related technologies of IS recurrence prediction by domestic and foreign academics. (2) The basic principles of BPNN and logistic regression are introduced, and the logistic multifactor predictor is constructed. (3) The experimental results show that the consistency rate, sensitivity, and specificity of the prediction results of BPNN are higher than those of logistic regression, indicating that for diseases such as IS, which have many pathogenic factors and complex relationships between factors, the fitting effect of BPNN model is better than that of the logistic regression model.

1. Introduction

Stroke follows cardiovascular disease as the major cause of mortality globally. There are about 2.4 million new stroke patients in my country every year, about 1.1 million deaths, and about 11 million poststroke survivors, of which about 70% are IS. IS refers to the ischemic necrosis or softening of the brain tissue caused by ischemia and hypoxia caused by various reasons. About 75% of IS patients have different degrees of dysfunction after the onset of the disease, which brings a heavy burden to the family and society, and IS is prone to relapse within 1 year after the onset of IS. The recurrence rates were 10.9%, 13.4%, and 14.7%, respectively [1]. In the first year after the onset of acute ischemic stroke, 1 in 18 people has a recurrent stroke. The harm caused by the recurrence of IS is far greater than that of the first stroke. The neurological damage caused by the recurrent stroke is more serious and more refractory and has a higher mortality rate than the first stroke. It is one of the main causes of death, rehospitalization, and long-term disability. Cumulative mortality after relapse more than doubled compared with first-episode stroke, and the risk of death increased approximately 17-fold. Reducing the recurrence rate of IS is the key to improving the prognosis of stroke, and secondary prevention of stroke can reduce the risk of IS recurrence by about 13% to 67%. Risk prediction of IS recurrence is the key to effective clinical secondary prevention and the most effective means to reduce the fatality and disability rate of patients [2, 3]. There are two main issues to consider in IS recurrence risk prediction: the first is the selection of IS recurrence risk predictors. The screening of predictive factors is an important step in the construction of predictive models, and the effect of predictive models depends on the accuracy and sensitivity of the predictive factors. There are many studies on the risk factors of IS recurrence, but the research results are affected by regions, medical policies, social economy, etc., and the research results are different.

It is difficult to predict the screening of IS recurrence risk factors. The risk prediction of IS recurrence is mainly based on clinical factors. The results of risk verification showed that the AUC values of the area under the ROC curve were all 0.59. The model did not cover all risk factors and had a limited predictive impact, making it difficult to meet clinical needs [4]. Scholars have increasingly employed biomarkers, imaging markers, and TCM syndrome distinguishing signals to the risk prediction of IS recurrence, improving the prediction effect in recent years. The method for building the IS recurrence risk prediction model is the second. Traditional statistical methods for developing prediction models, such as Cox proportional hazards regression analysis and logistic regression analysis, are currently the most popular. Traditional statistical methods, on the other hand, have stringent data-type restrictions and have little impact on data mining when dealing with clinical recurrence data. Machine learning has started to be used to illness risk prediction with the progress of computer technology and smart medical care. Multidisciplinary in nature, machine learning draws on a variety of fields, including probability and statistics, approximation theory, convex analysis, and theory of algorithm complexity. The artificial intelligence of a computer may be realized via the machine’s ability to learn from the data’s inherent regularity information and gain new experience and knowledge [5, 6]. Machine learning algorithms have the characteristics of high efficiency and accuracy in processing big data, and the prediction effect is better than traditional statistical methods [7]. The disease risk prediction model built with BPNN, SVM algorithm, XGBoost algorithm, and other machine learning algorithms outperforms established statistical methods such as the logistic regression model and the Cox proportional hazards regression model [8]. As a result, this work used prospectively gathered medical-related data from IS patients, applied multivariate analysis to screen out risk factors affecting IS prognosis, and built a machine learning algorithm-based risk prediction for ischemic stroke with a bad prognosis. The model can quickly predict and identify the risk of poor functional recovery of IS patients in the early stage of disease diagnosis and treatment, scientifically assist the selection of clinical treatment plans, improve the prognosis of stroke patients, reduce the disability rate of stroke, and improve the quality of medical care, and it provides a new idea for predicting the prognosis of IS.

The following is the paper’s organisation paragraph: in Section 2, the related work is provided. The suggested work’s approaches are examined in Section 3. The experiments and results are discussed in Section 4. Finally, the research is completed in Section 5.

A large number of studies have shown that the prognosis of ischemic stroke is affected by factors such as age, family history, smoking, drinking, hypertension, abnormal lipid metabolism, abnormal glucose metabolism, and hyperhomocysteinemia. This provides a good basis for the risk prediction study of poor prognosis of ischemic stroke [9, 10]. At present, traditional evaluation tools, such as Essen scale, ABCD2 scale, and SPI-II scale, are mainly used in clinical practice for ischemic stroke to predict the recurrence risk of patients. ESRS is based on the stroke prediction model developed by the CAPRIE Institute. The higher the score, the greater the risk of stroke recurrence. The Essen score is currently the most commonly used scale in the world to predict the long-term recurrence risk of stroke [11]. Reference [12] developed the ABCD2 scoring method based on the ABCD scoring system proposed by the “Oxfordshire Community Stroke Project” study to assess the risk of stroke in patients with transient ischemic attack. Reference [13] proposed the SPI-III scale in 2000 to assess the long-term recurrence risk of stroke patients. Age over 70, diabetes, coronary heart disease, and a history of stroke are all listed as risk factors for stroke recurrence in the evaluation tool. Using manual or traditional statistical analysis methods, the above-mentioned scoring scales combine several risk factors that affect stroke prognosis and assign varying weights to each risk factor. Diverse groups of people have different lifestyles; thus, it is still unclear whether the weights allocated to these aspects are appropriate for Chinese people. At present, for the prediction of IS prognosis, domestic and foreign scholars still mostly use retrospective cohort studies to construct a risk prediction model for poor stroke prognosis based on traditional statistical analysis of logistic regression and Cox regression [14, 15]. The medical and health sector has entered the age of big data as a result of the fast growth of medical and health informatization building. Machine learning algorithms have been extensively employed in the medical industry, such as illness prediction, disease prognosis evaluation, disease auxiliary diagnosis, and health management, in the face of these large data with huge volume, varied kinds, and high hidden value [16]. This is because machine learning algorithms can take into account more variables and effectively reflect the unpredictable and complex nature of the human body than traditional prediction models that only consider one or two variables. A multilayer feedforward neural network known as the BPNN was initially suggested by researchers in 1986, and it has since become the most widely used neural system model in neuroscience [17]. BPNN does not have high requirements for data types and has strong nonlinear mapping ability and adaptive ability, which can effectively dig out the influencing factors related to the occurrence of diseases from the complicated medical big data and scientifically evaluate the occurrence of disease risk [18]. Reference [19] used BPNN and logistic regression model to construct a risk prediction model for spontaneous hemorrhagic transformation in ischemic stroke patients, respectively. The results showed that the predictive performance of BPNN was better than that of the logistic regression model. Random forest is a supervised learning ensemble algorithm proposed by reference [19] in 1995. Because of its great antinoise capabilities and difficult overfitting, the algorithm is commonly employed in illness risk prediction. The random forest model outperforms the BPNN model and the logistic regression model in predicting the risk of coronary stenosis, according to academic research, and its AUC values are 0.752, 0.723, and 0.739, respectively. Based on the current research status at home and abroad, various traditional risk scale scores have achieved certain results in the prognosis prediction of ischemic stroke, but various scales only include simple indicators such as age and past medical history, and the prediction accuracy is often low and cannot fully explain whether there is an interaction between the indicators. In addition, the existing studies also have problems such as small sample size, single center, retrospective studies, and low data quality and sample representativeness [20]. Because of the fast growth of data mining technology and the increasing amount of data in electronic medical records, the use of machine learning in the area of illness prognostic prediction has made significant strides [2124]. There are still few studies on prognosis prediction of hemorrhagic stroke patients, and further research is needed.

3. Method

3.1. The Basic Principle of Artificial Neural Network

Analogous neural networks (ANNs) are mathematical models of the brain’s synaptic connections. Modern neuroscience research has resulted in the reduction, abstraction, and modeling of the structure and function of the biological nerve system of the human brain, which reflects the human brain’s essential properties. This nonlinear network system is made up of a large number of small basic units called neurons that are connected in a way that resembles how the human brain processes information and distributes processing. ANN does not use mathematical methods for precise calculation, nor does it need to predetermine basic functions like regression equations, but directly train the neural network with data or signal samples, and obtain a response after a limited number of iterative calculations. Another advantage of the neural network method is that it requires no prior information and offers self-learning, self-adaptation, and fault tolerance, making it ideal for multivariate pattern identification. It provides a new way to solve the uncertainty, ambiguity, and dynamic complexity in hygiene evaluation. At present, it has been widely used in the prediction, diagnosis, image processing, and other aspects of diseases in the medical field, especially the error backpropagation algorithm which is more widely used.

3.1.1. Backpropagation BP Network Algorithm

The error backpropagation technique was first published by Paul Werbos in 1974, although it was not extensively used at the time. BPNN is a multilayer forward neural network based on this approach. It was not until the mid-1980s that some scholars conducted further research on the BP algorithm and wrote the BP algorithm into the book “Parallel Distributed Processing,” and the BP algorithm was not widely known. BPNN is a typical multilayer feedforward neural network using a teacher-supervised learning algorithm. It has powerful computing power and can master the input-output mapping relationship implied by the learning sample through training and has a wide range of applications in classification and prediction. Parallel networks are used in the BPNN. Hidden node output signals are passed to the output node, and eventually, the output result is supplied. Forward propagation and backward propagation of errors are both used in the algorithm’s learning phase. A forward propagation method is used to process input information from the input layer to the hidden layer and transmit it to the output layer. A neuron’s current state solely influences the following neuron’s current state. In the event that the desired output result cannot be attained in the output layer, the error signal is returned over the original connection channel and is then processed via backpropagation. The mean square of the error is reduced by adjusting the weights of neurons in each layer. BPNN’s excellent nonlinear mapping ability and generalization function have been shown by neural network theory. A three-layer network may realize any continuous function or mapping.

3.1.2. BP Network Structure

BPNN is by far the most famous and widely used neural network. Theoretically, the complex nonlinear relationship can be fully described by a three-layer feedforward network. An input layer, a hidden layer, and an output layer are the three layers that make up a network. The topology of a typical BPNN is shown in Figure 1.

Because of this, the BP algorithm alters its weights and biases in the opposite direction of gradient, which is similar to how a linear network’s learning algorithm works. A mathematical formula for the BP algorithm’s iterative computation is as follows: where represents the current weight and bias, represents the next weight and bias generated by iteration, is the gradient of the current error function, and represents the learning rate.

Using an example of a BPNN with two hidden layers and four total layers of neurons, we can figure out how the learning process works in this section. is the number of inputs, and is the identifier for any one of them. An represents one of the neurons in the first hidden layer, which has neurons in total. There are a total of neurons in the second hidden layer, each of which has been labeled with. There are neurons in the output layer, each of which is symbolized by the letter . is the weight between the th neuron in the input layer and the first hidden layer, which is indicated as the weight output from the th neuron to the first hidden layer. Denotes the weight between the first and second hidden layers. Input and output weights are indicated by and, respectively.

The neuron input is denoted as , the output is denoted as , and represents the input of the neuron in the first hidden layer. Let the transfer function of all neurons be the sigmoid function. The training sample set is , and any training sample is an -dimensional vector, that is, ; the expected response is , and the actual output is . Let be the number of iterations, and both the weights and the actual output are functions of . When the network input training sample is , the network signal is transmitted in a forward manner. For the intermediate value of each layer, the expression can be written as follows.

The input of the neuron in the first hidden layer is

The output of the neuron in the first hidden layer is

The output of the neuron in the second hidden layer is

The input of the neuron in the output layer is

The output of the neuron in the output layer, that is, the network output, is

The output error of the neuron in the output layer is

Define the error energy as , and the sum of the error energy of all neurons in the output layer is

The error is opposite to the signal and propagates from back to front, and in the process of backpropagation, the weights and biases are modified layer by layer. The adjustment process of backpropagation and error is calculated below.

3.1.3. Weight Adjustment

A partial differential of output error energy compared to the predicted response to weight is used in the BP algorithm, and the sign of this difference is used to alter weight. Following is the calculation of this partial differential’s value. According to the learning rule of gradient descent, the correction amount of is where is the learning step size, is the local gradient, and can be obtained according to the forward propagation process of the signal. Thus, the next iteration value of is calculated.

Get the next iteration value of the weight before the hidden layer and the output layer :

In the same way, the next iteration value of the weight before the hidden layer and the hidden layer can be obtained:

Through the following iterative formula, the weight between the input layer and the hidden layer of the next iteration can be obtained.

The above is the weight iteration formula of the learning rule of the full sigmoid transfer function BPNN including two hidden layers.

3.2. Basic Principles of Logistic Regression
3.2.1. Logistic Regression Model

One way to look at the connection between binary data and some of their influencing elements is through a probabilistic nonlinear regression approach known as logistic regression (LR). It is often used in epidemiological research to examine the quantitative association between illnesses and other risk variables. Let the dependent variable be a binary variable whose value is

There are also independent variables that affect the value of . Note that represents the probability of a positive result under the action of independent variables, and the logistic regression model can be expressed as where is the constant term and is the regression coefficient. If is used to represent the linear combination of independent variables,

Transforming formula (14), the logistic regression model can be expressed as the following linear form:

The left end of formula (16) is the natural logarithm of the ratio of the probability of positive and negative results, which is called the logit transformation of , and is recorded as . It can be seen that although the value range of probability is between 0 and 1, has no numerical limit.

3.2.2. Logistic Regression Modeling

A logistic regression model was developed using 500 patients acquired from a retrospective investigation as training samples. To examine the prediction performance, 200 patients from a prospective inquiry were employed as prediction samples, which were substituted into the developed model. Using the selected training samples, univariate logistic regression analysis was performed on the patients’ basic information, clinical information, clinical biochemical indicators, postdischarge rehabilitation, and living conditions. For factor logistic regression model, see Table 1. Multifactor screening was carried out, and the forward method based on partial maximum likelihood estimation was used. With as the inclusion criterion and as the exclusion criterion, a logistic regression model for predicting the recurrence of ischemic stroke patients was established. There are 7 factors that finally entered the model (see Table 2).

4. Experiment and Analysis

4.1. Multivariate Logistic Regression Results

From the 7 influencing factors screened above, a logistic regression model is established, and its expression is ; in the formula,,,,,,, andrepresent 7 factors of age, diastolic blood pressure, language barrier, alcohol consumption, triglyceride, aspirin, and sleep, respectively. Substitute 200 test samples into the logistic model established above, draw the ROC curve as shown in Figure 2, and calculate the area under the curve (AUC): AUC is 0.735, the prediction accuracy rate is 82.5%, and the sensitivity and specificity are 63.2% and 75.6%, respectively. Youden index is 38.2% (see Table 3).

4.2. BP Neural Network Modeling Results
4.2.1. Establishment and Training of the Network Model

500 retrospectively investigated cases were used as the training set, and 200 prospectively investigated patients were used as the testing set. In order to simplify the calculation and prevent unnecessary overfitting, logistic regression was used to screen all factors by single factor in this study, and all 16 factors screened out by a single factor were used as input variables, that is, the input layer neurons . Modeling is done using three distinct types of simple BPNN models with varied amounts of hidden layers. The number of hidden layer nodes is calculated using the trial and error method in this study, with the first hidden layer node being defined as 8 and the second and third layers being reduced layer by layer to 6 and 3, respectively. At the same time, the maximum training error is to be selected as 0.001, the initial learning rate is 0.15, the minimum learning rate is 0.001, the maximum learning rate is 0.2, and the kinetic energy term .

4.2.2. BP Neural Network Modeling Results

There is no statistically significant difference between the prediction results of the training set and the actual results of each model (), and the kappa values are all greater than 0.7, indicating that the prediction results are more consistent with the actual results. See Figure 3, so it can also be considered that the number of different hidden layers has little effect on the prediction results of the test set samples.

4.2.3. Comparison of the Area under the ROC Curve of the Three BPNN Models

The predicted probability and actual results of the three BPNN models are used to make the ROC curve. The experimental results are shown in Figure 4. It can be seen that the prediction accuracy of BPNN-1 is higher than that of the other two models.

4.2.4. Comparison of Prediction Accuracy and Validity of Models with Different Numbers of Hidden Layers

The prediction accuracy rates of each model are 95.2%, 94.5%, and 93.7%, respectively, and there is no statistical significance between the three accuracy rates. It can be seen that there is no difference in the prediction accuracy of the BPNN with different hidden layers. The results are shown in Figure 5.

4.2.5. Analysis of Influencing Factors of BPNN

Increasing the number of hidden layers cannot improve the prediction effect of BPNN and may even affect the accuracy of model prediction. At the same time, the modeling time of a single hidden layer is short, and overfitting is not easy to occur. Choose a BPNN with one hidden layer. According to the influence degree of the imported influencing factors on the network, the top three influencing factors with the highest degree of influence are ADL, diastolic blood pressure, and aspirin consumption.

4.2.6. Predictive Validity and Area under the ROC Curve of the Test Set of the BPNN Model

Substitute the data of the test set into the trained neural network model, and draw the ROC curve. The area under the ROC curve was calculated to be 0.796, and the calculated agreement of the model prediction was 86.8%, the sensitivity was 82.1%, the specificity was 80.5%, and the Youden index was 64.7%. The specific results are shown in Table 4.

4.3. Comparison Results between BP Neural Network and Logistic Regression

Compared with the logistic regression prediction model, the product under the ROC curve of the BPNN prediction result was 0.796, which was greater than the 0.735 obtained by the logistic regression prediction model, and the difference was statistically significant when comparing the area under the curve of the two models. The accuracy rate, sensitivity, specificity, and Youden index of the BPNN are also higher than those of the logistic regression model, so the prediction effect of the BPNN is better than that of the logistic regression, as shown in Figure 6.

5. Conclusion

The continuous development of my country’s social economy has improved the living standards of residents. Because of the prevalence of unhealthy lifestyles in my country, the incidence of stroke has continued to climb, and the age of onset has gradually decreased. Stroke has had a significant impact on our people’s health. In my country, it has become one of the most serious public health issues. The latest data from the Global Burden of Disease Study show that from 2005 to 2017, the incidence of ischemic stroke in my country was on the rise. In 2017, there were 156 new ischemic strokes per 100,000 people in my country. Research shows that the disability-adjusted life years lost due to stroke in my country ranks first among all diseases, with a recurrence rate of 9.7% within three months of onset and a disability rate of 37.1%. Because BPNN has been paid more and more attention by medical workers in disease prediction, people often use it to compare with the logistic regression model. The advantages of the logistic regression model are that it is simple and easy to use, the quantitative interpretation of the individual effects of factors is clear, the approximate estimation of the relative risk can be directly obtained, and the methodology of the quantitative dependence of variables can be established. The neural network model uses the information theory method, along with a human-like thinking mode, to develop the network by learning existing examples. It has a strong ability to solve the collinear effect and interaction between variables, and it has no restrictions on the distribution of data and can make full use of data information, with strong fault tolerance. As a nonlinear mathematical model, neural networks aid in the discovery of undiscovered correlations among various variables. Therefore, the following work is done in this paper: (1) the research progress and related technologies of IS recurrence prediction by domestic and foreign scholars are introduced, and the theoretical basis for the BP prediction model and logistic regression prediction model proposed in this paper is provided. (2) The basic principles of BPNN and logistic regression are introduced, and the logistic multifactor predictor is constructed. (3) The experimental results are that the area under the ROC curve of the logistic regression prediction model is 0.735, and the area under the ROC curve of the BPNN prediction results is 0.796. The Youden indices of BPNN and logistic regression are 64.7% and 38.2%, respectively, indicating that the prediction effect of BPNN is better than that of logistic regression. The consistency rate, sensitivity, and specificity of BPNN prediction results are greater than those of logistic regression, showing that the BPNN model has a superior fitting effect for disorders like ischemic stroke, which have many pathogenic components and complex connections between them.

Data Availability

The datasets used during the current study are available from the corresponding author on reasonable request.

Conflicts of Interest

The authors declare that they have no conflict of interest.