Colorectal cancer (CRC) is the third cause of cancer-related death and the fourth most frequently diagnosed cancer across the globe. The objective of this study is to obtain novel and effective diagnostic markers to enrich CRC diagnosis methods. Herein, exosomal miRNA expression data of CRC and normal blood were subjected to XGBoost algorithm, and 5 miRNAs related to CRC diagnosis were primarily confirmed. Then multilayer perceptron (MLP) classifiers were constructed based on different subsets. Via integrated feature selection (IFS), we noticed that the MLP classifier constructed by the first four miRNAs (miR-654-5p, miR-126, miR-10b, and miR-144) had the highest Matthews correlation coefficient (MCC). Subsequently, principal component analysis (PCA) for dimensionality reduction was performed on samples based on the miR-654-5p, miR-126, miR-10b, and miR-144 expression data. The signature based on these four feature miRNAs, as the analysis indicated, could effectively distinguish CRC samples from normal samples. Further, we extracted the exosomes from clinical blood samples and applied qRT-PCR analysis, which revealed that the expression of these four feature miRNAs was in the trend of that in the test set. Collectively, these four feature miRNAs might be tumor biomarkers in the serum, and our study offers innovative thinking on early-stage CRC diagnosis.

1. Introduction

As the third cause of cancer-related death and the fourth most frequently diagnosed cancer [1], colorectal cancer (CRC) presents a growing morbidity and death rate, making it a public health burden [2]. According to population and disease statistics, nearly 2.2 million new cases would be developed by 2030 [3]. CRC is a genotypically and phenotypically heterogeneous disease characterized by different molecular characteristics [4]. Accurate early diagnosis enables CRC patients to receive timely and precise treatment, thereby reducing CRC mortality. Although colonoscopy screening is the gold standard for CRC screening, its participation rate in population screening programs is still poor due to the invasive nature of the test and the need for adequate bowel preparation [58]. In addition, some studies have implied that carcinoembryonic antigen and calprotectin can be used as diagnostic markers for CRC, but their specificity and sensitivity are low, and they cannot be effectively applied to the early diagnosis of clinical CRC at present [9, 10]. Hence, it is necessary to develop effective biomarkers for CRC to improve the early diagnosis rate for CRC and offer effective biomarkers for CRC treatment.

Recently, exosome biomarkers containing multiple RNA and proteins have become the focus of research in cancer diagnosis and treatment [11]. Exosomes are tiny goblet vesicles with 30-140 nm in diameter that are secreted by cells including immune cells, neural cells, stem cells, and tumor cells [1214]. Increasing research manifested that exosomes relate to tumorigenesis. Tumor-derived exosomes are involved in the exchange of genetic information between tumor cells and basal cells, thereby regulating angiogenesis and promoting tumor growth and invasion [15]. Currently, useful biomarkers have been identified from exosomes for the application in CRC diagnosis. It has been demonstrated that in blood exosomes, miR-125a-3p and miR-638 are helpful for early diagnosis of CRC in clinical practice [16, 17]. These all demonstrated the importance of exosomal miRNAs in screening early-stage CRC. Therefore, we further identified potentially effective exosomal miRNAs that may work for CRC diagnosis, so as their regulatory networks, which are beneficial for comprehensively understanding the molecular mechanisms underlying CRC development.

The rapid development of biotechnology in the age of big data stimulated the application of bioinformatics in medical research; bioinformatics technology based on high-throughput sequencing data is an effective and promising analytical tool for analyzing and identifying biomarkers for cancer diagnosis [18, 19]. Machine learning is a new artificial intelligence technique that has been gradually applied to medical research in recent years. Lian et al. [20] trained medulloblastoma stemness index based on a machine learning method of one-class logistic regression to obtain gene expression-based stemness index and methylation-based stemness index and further identified their corresponding potential drugs, which provides new ideas for the survival of medulloblastoma patients or targeting stem cells. Koppad et al. [21] screened diagnostic candidate genes for CRC based on six methods of machine learning classification including Adaboost, ExtraTrees, logistic regression, Naive Bayes classifier, random forest, and XGBoost. Thus, there is potential for wider application of novel bioinformatics methods to identify novel diagnostic biomarkers based on public databases.

In this study, by analyzing the miRNA expression data of CRC patients and normal people in the Gene Expression Omnibus (GEO) database, we preliminarily screened miRNAs with potential diagnostic value based on XGBoost and established a multilayer perceptron (MLP) classifier to determine the optimal miRNA combination by taking integrated feature selection (IFS). Thereafter, the clinical value of diagnostic markers in CRC was dissected by testing their levels in the blood exosomes of clinical patients with CRC. To conclude, our study provided potential biomarkers which are supposed to be effective to CRC clinical diagnosis.

2. Materials and Methods

2.1. Data Source and Preprocessing

Exosomal miRNA data of CRC patients and normal people were downloaded as GSE39833 (tumor: 88 and normal: 11) from Gene Expression Omnibus (GEO) (https://www.ncbi.nlm.nih.gov/geo/), annotated by the platform of Agilent-021827 Human miRNA Microarray G4470C GPL14767. Differential analysis was performed by R package “limma” [22] on the standardized miRNA expression data (, ).

2.2. XGBoost Feature Selection

XGBoost is a tree boosting scalable machine learning system, which generates a single strong learner by combining multiple weak learners. XGBoost estimates the value of the loss function through a second-order Taylor series and further reduces the likelihood of overfitting by applying regularization [23]. The objective function of XGBoost is a gradient advancing decision tree approach defined as

Loss means training loss, represents the complexity of trees, and stands for the amount of trees. The model can be optimized by minimizing the objective function. Hence, we adopted the addition training method to calculate the training loss and rapidly optimized the prediction of the th round of addition training by taking the Taylor expansion method. The optimal complexity of the tree was determined via the greedy algorithm.

In order to find miRNAs that could distinguish CRC from normal samples in GSE39833, we utilized XGBoost to rank the importance of feature miRNAs. Five characteristic miRNAs associated with CRC diagnosis were filtered for subsequent analysis. Then, based on SMOTE method, we applied python package “imblearn” and Bayesian optimization to resample the training set to reduce the effect caused by data disequilibrium.

2.3. Construction of the MLP Classifier

To construct a diagnostic classifier that was more precise, we constructed MLP classifiers of different subsets based on these five characteristic miRNAs by python package “sklearn” [24] after XGBoost feature selection. For the MLP classifier, hidden layers were set as 2, and all possible combinations were scanned in the first layer (the number of nodes from 1 to 5) and in the second layer (the number of nodes from 1 to 5) by sklearn.neural_network. Other parameters included (1) , (2) , (3) , and (4) .

2.4. Screen of Optimal Feature Genes

The MCC of the above classifiers was obtained using IFS. MCC is the correlation coefficient of binary classification between the observation and prediction, with its value between -1 and +1. +1 stands for a perfect prediction, while -1 for a total inconsistency between observation and prediction. The MCC value is a single score that is the most informative for the prediction quality of binary classifiers built in a confusion matrix environment [25]. The IFS curves were plotted, with abscissa for MLP classifiers based on different subsets and ordinate for MCC of subsets. The classifier with the highest MCC was selected as the optimal classifier for CRC diagnosis.

2.5. Principal Component Analysis (PCA)

PCA is a dimensionality reduction algorithm that is most widely adopted. Its main idea is to map the n-dimensional data in space onto the k-dimension, a novel orthogonal feature that is the principal component [26]. We performed PCA by the R package “FactoMineR” [27] based on the characteristic miRNA expression data in the optimal MLP classifier to explore the sample discriminatory capability of this classifier (https://www.rdocumentation.org/packages/FactoMineR/versions/2.4).

2.6. Clinical Collection of Blood Sample

Between 10-2018 and 10-2021, 100 patients with CRC and 120 healthy participants were recruited from Shanxi Bethune Hospital, Shanxi Academy of Medical Sciences, Tongji Shanxi Hospital, Third Hospital of Shanxi Medical University in Taiyuan city, Shanxi province, with their clinical information and serum samples collected (Supplementary Table 1). None of the CRC patients received any treatment, while their cancer stages were judged on the basis of the American Joint Committee on Cancer (AJCC) Cancer Staging Manual (7th Edition) [28]. Peripheral blood (5 ml) from all participants was collected in 5 ml clotting tubes (Greiner Bio-One, Austria). Serum was separated by centrifugation and stored at -80°C for subsequent miRNA extraction.

This research is approved by the Ethics Committee of Shanxi Bethune Hospital, Shanxi Academy of Medical Sciences, Tongji Shanxi Hospital, Third Hospital of Shanxi Medical University. Besides, all participants were well-informed about the necessary information of this study and signed the written informed consent.

2.7. Exosome Separation

The exosome separation followed the steps described by Han et al. [29]. And the exosomes acquired were resuspended in phosphate-buffered saline (PBS). The suspension was placed on a chloroform-coated copper grid with 0.125% Formvar and negatively stained with uranyl acetate. Morphological identification of the exosomes was by a transmission electron microscopy (TEM).

2.8. RNA Extraction and qRT-PCR

Total RNA from the obtained exosomes was extracted following the miRNeasy Micro Kit (QIAGEN, Germany), and RNA quantity and quality were tested via Agilent Bioanalyzer 2100 (Agilent, USA). cDNA was synthesized by reverse transcription from total RNA using SuperScript III Reverse Transcriptase kit (Invitrogen, USA), and qPCR was performed using SYBR Premix Ex Taq II (Takara, Japan). qRT-PCR was performed using ABI7500 (7500, ABI, USA), and the relative expression of all miRNAs was calculated using the 2-ΔΔCT method. U6 was the internal reference. Table 1 shows primer sequences for feature miRNAs.

2.9. Statistical Analysis

Based on analysis performed by GraphPad 8.0, box plots were drawn. Differences in the relative expression of miRNAs between tumor and normal samples were analyzed using the -test, and indicated a difference that was statistically significant.

3. Results

3.1. Constructing the Diagnostic Model of CRC

56 differentially expressed miRNAs (DEmiRNAs) were obtained by normalization and differentially analyzing miRNAs data derived from CRC and normal exosomes. Subsequent XGBoost feature selection indicated the top five miRNAs with the best ability to distinguish sample types. To determine the optimal diagnostic classifier for CRC, we constructed different MLP classifiers and plotted IFS curves to visually select miRNA combinations. Through the IFS curve, it was found that the classification effect of the MLP classifier composed of the first four miRNAs (miR-654-5p, miR-126, miR-10b, and miR-144) was good, and the 10-fold cross-validation results showed that its MCC value was high (Figure 1), and the sensitivity of this model was 0.977, the specificity was 1.000, the accuracy was 0.980, and the MCC was 0.909.

3.2. Validation of the Performance of the Diagnostic Model

The expression data of four miRNAs in MLP classifiers in CRC and normal samples were subjected to PCA dimensionality reduction. Shown in Figure 2 were that PCA could significantly distinguish CRC and normal samples. Dim1 contributed 41.6% and Dim2 contributed 32.2%. From the violin plots, we could see that levels of blood exosomal miR-654-5p, miR-126, and miR-10b from CRC patients were markedly higher, but miR-144 was markedly lower than normal participants (Figures 3(a)3(d)). The above results exhibited that the MLP formed by the former four miRNAs showed the value to assist CRC diagnosis.

3.3. qRT-PCR of miRNAs from Clinical Samples and Receiver Operator Characteristic (ROC) Analysis

To validate the performance of this model in clinical CRC diagnosis, we recruited 100 CRC and 120 healthy participants (Table 2), collected their blood samples, and extracted exosomes for qRT-PCR. Exosomes were first extracted from the blood of CRC patients as well as healthy participants, and the isolated exosomes were subsequently validated for size and morphology. Under a TEM, we could observe that the extracted exosomes were oval membrane-bound vesicles, which were about 50 nm-150 nm in diameter (Figure 4(a)). Thereafter, the qRT-PCR revealed that levels of blood exosomal miR-654-5p, miR-126, and miR-10b from CRC patients were markedly higher (Figures 4(b)4(d)), but miR-144 was markedly lower than normal participants (Figure 4(e)). Data from qRT-PCR were collected for validation of the performance of the diagnostic model in CRC diagnosis. As results suggested, the ROC of the 4-miRNA diagnostic model was 0.913 (Figure 4(f)), and the recall of the model was 0.91, specificity was 0.34, accuracy was 0.6, and f1 was 0.67. Collectively, qRT-PCR on clinical samples validated that this 4-miRNA model could distinguish CRC and normal samples precisely, enabling these miRNAs to be biomarkers for CRC diagnosis.

4. Discussion

As key regulators in a variety of biological and physiological processes, miRNA dysregulation may be tightly linked to changes in the pathological environment of disease [3032]. Colonoscopy is the gold standard for the pathological diagnosis of CRC, but it causes a large physical as well as psychological burden to patients due to its high invasiveness [58]. Owing to patients’ avoidance of colonoscopy, CRC cannot be diagnosed promptly at the early stage and is only diagnosed at advanced stages when tumor metastasizes to other tissue [33]. The advantage of miRNA detection relative to invasive colonoscopy is that samples are more accessible in clinical practice both in body fluids and blood. At the same time, this noninvasive examination greatly alleviates the physical burden on patients [32, 34, 35]. Given its noninvasive and easily accessible properties, miRNAs are promising biomarkers in CRC diagnosis.

We here utilized XGBoost to determine the key features by ranking feature importance and recursive elimination. We determined the top 5 miRNAs that could accurately distinguish CRC cancer patients from healthy individuals and subsequently found via IFS method that the MLP classifier composed of the top four miRNAs was the best for CRC diagnosis. MLP is a dynamic classifier based on neural network, which could directly determine the separating hyperplanes between the two types of events, with high accuracy of classification and strong ability of parallel distribution processing. At present, there are also some studies on constructing CRC diagnostic classifiers based on machine learning algorithms. Koppad et al. [21] screened CRC diagnosis-related genes by the random forest algorithm, which has the advantage of avoiding data overfitting and reducing the computational load of the model. We aimed to filter biomarkers that could diagnose cancer. While MLP is to classify two types of events, therefore, it was our tool for identifying miRNAs that could assist CRC diagnosis.

The top four miRNAs selected by IFS (miR-654-5p, miR-126, miR-10b, and miR-144) could accurately diagnose CRC. These four miRNAs have all been reported in CRC. Reported by Li et al. [36], the decreased level of miR-654-5p is markedly correlated with the clinical stage of colon cancer by analyzing miR-654-5p level in tissue from CRC patients and normal participants, indicating that its level might be closely related to the CRC progression. As stated by Ebrahimi et al. [37], low miR-126 level in CRC is linked to CRC histological subtype, perineural tumor invasion, microsatellite instability pathological analysis, and lymph node distal metastasis. One study indicated that upregulated miR-10b is discovered in CRC patients with liver metastases, positively linked to advanced TNM stage, and able to predict advanced clinicopathological features and liver metastasis in CRC [38]. Research by Choi et al. [39] indicated that stool from CRC patients is a novel screening biomarker, and the miR-144 level in the stool has good sensitivity and specificity for CRC detection. Finally, we collected blood samples from CRC patients and normal participants for qRT-PCR, and the expression trends of miRNAs were consistent with those reported in the literature, which also validated the accuracy of our study. Further, PCA revealed that the MLP diagnostic classifier composed of miR-654-5p, miR-126, miR-10b, and miR-144 could well distinguish samples from CRC patients and normal individuals. Hence, these four miRNAs could be unique biomarkers for noninvasive examination of CRC.

However, limitations still exist. Our study utilized the limited numbers of public datasets and did not take into account factors like age, gender, ethnicity, and tumor TNM stages, which may affect miRNA expression. Hence, the construction of a more precise diagnosis model can be achieved by carrying a more detailed analysis on these factors, providing science-based evidence for the clinical noninvasive diagnosis of CRC. Overall, we performed XGBoost and constructed an MLP classifier to identify four miRNAs with the highest diagnostic value. PCA and ROC curves suggested favorable performance of the 4-miRNA classifier to distinguish CRC patients from normal individuals. This study sheds light on science-based theory for the noninvasive diagnosis of CRC.

Data Availability

The data and materials in the current study are available from the corresponding author on reasonable request.

Ethical Approval

This research is approved by the Ethics Committee of Shanxi Bethune Hospital.

All participants were well-informed about the necessary information of this study and signed the written informed consent.


The funders did not participate in designing, performing, or reporting in the current study.

Conflicts of Interest

The authors declare no conflicts of interest.

Authors’ Contributions

GD and CR conceived and designed the study. GD, CR, and JW performed the experiments. JW provided the mutants. GD and JM wrote the paper. GD, CR, and JW reviewed and edited the manuscript. All authors read and approved the manuscript.


This study was supported in part by grants from the Health Commission of Shanxi Province (No. 2021150) and Shanxi Province “136 Revitalization Medical Project Construction Funds”.

Supplementary Materials

Supplementary Table 1: clinical information of 220 participants. (Supplementary Materials)