Abstract

In big data analysis with the rapid improvement of computer storage capacity and the rapid development of complex algorithms, the exponential growth of massive data has also made science and technology progress with each passing day. Based on omics data such as mRNA data, microRNA data, or DNA methylation data, this study uses traditional clustering methods such as kmeans, K-nearest neighbors, hierarchical clustering, affinity propagation, and nonnegative matrix decomposition to classify samples into categories, obtained: (1) The assumption that the attributes are independent of each other reduces the classification effect of the algorithm to a certain extent. According to the idea of multilevel grid, there is a one-to-one mapping from high-dimensional space to one-dimensional. The complexity is greatly simplified by encoding the one-dimensional grid of the hierarchical grid. The logic of the algorithm is relatively simple, and it also has a very stable classification efficiency. (2) Convert the two-dimensional representation of the data into the one-dimensional representation of the binary, realize the dimensionality reduction processing of the data, and improve the organization and storage efficiency of the data. The grid coding expresses the spatial position of the data, maintains the original organization method of the data, and does not make the abstract expression of the data object. (3) The data processing of nondiscrete and missing values provides a new opportunity for the identification of protein targets of small molecule therapy and obtains a better classification effect. (4) The comparison of the three models shows that Naive Bayes is the optimal model. Each iteration is composed of alternately expected steps and maximal steps and then identified and quantified by MS.

1. Introduction

Next-generation sequencing (NGS), also known as high-throughput sequencing or massively parallel sequencing, is a technology that can sequence thousands to billions of DNA fragments simultaneously and independently. The dideoxynucleoside end-termination sequencing method began in the 1970s. In the follow-up continuous improvement, the Sanger method caused a sequencing boom and became the mainstream due to its simplicity and rapidity. In order to meet the increasingly complex research needs, next-generation sequencing technology emerges from time to time [13]. Using NGS technology to detect a variety of cancers, and compared the results with the Sanger method, it was found that in addition to common gene mutations, NGS technology can also detect many gene mutations that were ignored by real-time quantitative PCR detection technology. It may play a prompting and guiding role in the occurrence and development of cancer and the diagnosis and treatment of patients and also reflects the value of using NGS technology in clinical work. There are currently three mainstream NGS platforms: Roche454, Ion Torrent, and Illumina platforms. The Roche454 platform is based on the pyrosequencing method, that is, bases are incorporated in the order of T, A, C, and G during sequencing, and pyrophosphate is released after pairing. The Ion Torrent platform is the semiconductor sequencing technology. The ion sensor can detect the pH change caused by proton release during the synthesis process and then judge the sequence of the base. NGS technology detection programs have different focuses, including whole genome sequencing (WGS), which can detect all genetic changes and conduct a comprehensive analysis of tumor-related genes, but it is costly and time-consuming. Whole exome sequencing (WES), which only detects the coding gene regions, is more economical and can detect already known mutant coding genes and discover new gene mutations in cancer. Whole transcriptome sequencing, based on cDNA sequence sequencing, can detect information about the overall transcriptional activity [4]. Targeted target sequencing can select some genes required for disease research for higher sequencing efficiency, but it is not suitable for detecting unknown mutations [57]. The techniques of experimental manipulation (wet experiment) and bioinformatics analysis (dry experiment) have been developed continuously. NGS technology is widely used in solid tumors, and more new gene mutations have been discovered, providing new ideas for the detection of genetic susceptibility and the guidance of individualized precision medicine, and have played an extremely important role in the study of the genetic pathways of human malignant tumor mutations effect. Liver cancer is one of the most common cancers in cancer patients today [810]. According to the 2020 report by the American Cancer Society, there are an estimated 42,810 new cases and 30,160 deaths from the liver and intrahepatic cholangiocarcinoma in the United States throughout the year. Statistics at home and abroad show that liver cancer is an important cause of cancer death worldwide, and the treatment of liver cancer is also an urgent problem to be solved. Liver cancer is divided into two types: primary and secondary. Primary liver cancer (PLC) is the most common. From a histological point of view, primary liver cancer can be divided into different subtypes according to the cell origin, hepatocellular carcinoma (HCC) (about 75-85% of all cases), intrahepatic cholangiocarcinoma (ICC) (about 75% of all cases), intrahepatic cholangiocarcinoma (ICC) (10-15%), and other rare forms. Hepatocellular carcinoma has become the main type of liver cancer research. The main known carcinogens of liver cancer are hepatotropic virus: mainly chronic infection with hepatitis B (HBV) and hepatitis C (HCV) virus; chemical stimulation: such as alcohol abuse and aflatoxin; metabolic abnormalities: diabetes and nonalcoholic fatty liver disease, hereditary hemochromatosis, etc.; immune-related causes: such as cirrhosis-related immune dysfunction syndrome (CAID) and autoimmune hepatitis; etc. Among them, viral infection is the main factor causing liver cancer [1113]. Hepatocellular carcinoma cells have extensive heterogeneity from undesired lesions caused by a small number of mutations to eventually develop into an advanced form of the disease. Because the factors that induce liver cancer are diverse and the distribution in different countries and regions is different, the molecular mechanism of liver cancer is complicated. In a broad sense, liver cancer is divided into two categories: proliferative and nonproliferative. The proliferative type is common in HBV-induced liver cancer, with low degree of differentiation, high alpha-fetoprotein (AFP) expression, more vascular invasion, and worse prognosis; this type of liver cancer is characterized by increased inactivating mutations in TP53 and AXIN1, and at the same time, cell cycle, mTOR, RAS-MAPK, and MET signaling pathways that promote survival are all activated. The nonproliferative class is commonly seen in HCV and alcohol-related hepatocellular carcinoma, with moderate or high differentiation, low AFP expression, less aggressiveness, and chromosomal stability. This type of hepatocellular carcinoma is characterized by more heterogeneity, higher frequency of CTNNB1 (β-catenin) activating mutations, and TERT promoter mutations, as well as activation of WNT and IL6/JAK-STAT signaling pathways. However, these commonly mutated genes TP53, AXIN1, CTNNB1, and TERT in liver cancer proved to be difficult to target [14, 15]. At present, liver resection and liver transplantation have become the main treatment methods for patients with early-stage liver cancer, and patients with intermediate-stage liver cancer are often treated with hepatic arterial chemoembolization and radioembolization, which can greatly prolong the survival of patients. However, due to the lack of specific symptoms and tumor biomarkers, most HCC patients are diagnosed at an advanced stage, so these curative treatments are not suitable. Sorafenib, a multi-receptor tyrosine kinase inhibitor, was identified as a therapeutic drug with survival benefits for patients with advanced liver cancer. Multiple drugs have since been shown to have clinical efficacy, including other RTK inhibitors such as lenvatinib, regorafenib, and cabozantinib. The liver is an important organ that removes toxins and regulates blood sugar, fat, and amino acid uptake. Similar to all cancers, the gradual accumulation of genetic and epigenetic changes in the liver, accompanied by a large number of metabolic changes, leads to abnormal proliferation of mature hepatocytes and the evolution of liver cancer.. Due to the high heterogeneity of liver cancer cells and the complex pathogenic factors caused by the involvement of various signaling pathways, it is clinically found that using a unified treatment regimen to treat all patients may have different curative effects and may even exacerbate symptoms. Therefore, “personalized medicine” is the development direction of contemporary treatment of liver cancer, and different therapeutic methods based on molecular and cell therapy have also been developed. The emerging molecular-level therapeutic strategies include molecular targeted therapy, targeted radionuclide therapy, and epigenetic modification-based therapy, which provide new strategies for the treatment of liver cancer.

2. Big Data Analysis of Liver Cancer

2.1. Disease Diagnosis of Omics Big Data

Cancer subtype classification methods based on omics data mainly include subtype classification methods based on single omics data and subtype classification methods based on multiomics data fusion. The former is based on an omics data such as mRNA data, microRNA data, or DNA methylation data and uses traditional clustering methods such as kmeans, K-nearest neighbors, hierarchical clustering, affinity propagation, and nonnegative matrix decomposition to classify samples. The results of subtype classification of cancer are obtained, as shown in Figure 1. With the development of related technologies, the collection of omics data has shown an explosive trend, and its acquisition cost has been greatly reduced. A large number of genomics, transcriptomics, proteomics, and other data of different cancer patients are given in the database headed by TCGA omics data. Since different omics data can describe the complex life process in cancer cells and the interactions between various molecules from different perspectives, the information is complementary, and the integrated analysis of multiomics data can identify more accurate and reasonable subtype results. In recent years, research has mainly focused on the field of multiomics data integration analysis methods. Most existing integrative analysis methods need to address problems closely related to biological data. The data sample size is small, and the dimensionality is high (the so-called curse of dimensionality problem). When the data scope and data type are not consistent, the underlying omics-specific and between-group data structures are easily overlooked in multi-omics data. Divided from the data supported by integrative analysis methods, existing integrative analysis methods include general methods that can analyze any multiomics data and specialized methods designed only for specific data types. The former can be applied to any multiomics data and can be easily extended to the analysis of more omics data, while the latter requires the use of known biological relationships (such as the association between copy number changes and gene expression profiles), which can only be analyzed specific data types.

2.2. Epigenetics of Liver Cancer

Abnormal epigenetic changes are important etiologies for the occurrence, development, and metastasis of liver cancer. Epigenetics, the heritable modification of gene function without altering the DNA sequence, is caused by many different factors. Epigenetic alterations are often present in liver cancer. The screening process is shown in Figure 2. Epigenetic processes include, but are not limited to, chromatin remodeling, histone modification, DNA methylation, and expression of noncoding RNAs. Unlike the irreversible nature of genomic alterations, epigenetic changes are reversible, opening a promising avenue for the development of new therapeutic modalities. Therefore, epigenetic changes associated with cancer and liver cancer are gradually being widely used in the development of biomarkers. Hepatocellular carcinoma (HCC) is one of the most common liver tumors and has become the leading cause of cancer-related death in many regions and countries. Although many measures have been taken to prevent, early screen, diagnose, and treat liver cancer, the current situation of liver cancer in my country is still not optimistic.

3. Algorithm Model

3.1. Naive Bayes [1620]

Semiconductor sequencing

Ion sensor, is the matrix reconstruction function, is the group sparse constraint function, and is the weight of the group sparse constraint term.

Unknown genome sequence

Nucleic acid fragments are sequenced, Row is the number of clusters, is the index of the sample belonging to the th category, and is the sparse constraint function.

3.2. Okumura-Hata [2123]

Assembly and splicing

Coding gene regions for detection

Capture probe hybridization

Quality control

Data filtering

3.3. AdaBoost [2427]

For sequence alignment, is a shared implicit expression matrix, and are modality-specific basis matrices.

Subtype classification of omics data

Multiomics data fusion

Subtype classification results

Category division

4. Simulation Experiment

4.1. Big Data Analysis of Liver Cancer Sequencing Results

The coding of the multilevel grid adopts a simple construction by assuming that the attributes of a given target value are conditionally independent of each other. It can realize the space filling of the data dimensionality reduction map, and then learn the joint probability distribution from the input to the output from the training data set. In the calculation process, it is represented by binary values 0 and 1, and input the feature data set of the unknown category to obtain the output category vector that maximizes the posterior probability. Indexing is also commonly used in Geohash encoding algorithms. There are shown in Table 1, Figures 3 and 4. The assumption that the attributes are independent of each other reduces the classification effect of the algorithm to a certain extent. According to the idea of multilevel grid, the mapping from high-dimensional space to one-dimensional space is one-to-one. The complexity is greatly simplified by encoding the one-dimensional grid of the hierarchical grid. The logic of the algorithm is relatively simple, and it also has a very stable classification efficiency. The grid division and coding rules are all calculated from the grid definition. The calculation process is to bisect the longitude values of the grid, and the data sets containing missing values are not sensitive.

4.2. Encoding Process of Liver Cancer Data

Big data has three invisible connotations of space, time, and semantics. As shown in Table 2 and Figure 5, convert the actual spatial position of the data to the position in the global multilevel grid, so that the two-dimensional representation of the data is converted into a binary one-dimensional representation, , , , , , , and . It realizes the dimensionality reduction processing of data and improves the efficiency of data organization and storage. Grid coding expresses the spatial position of data, maintains the original organization method of data, and does not abstract data objects. , , , , , , and . Instead, the expression is converted again, that is, the way of grid code identification. According to the actual area range of the object, the regional characteristics of the data are expressed as grid units, and the final grid code is composed of the codes of the grid units.

4.3. Naive Bayes Algorithm Training on Sequencing Data

The Naive Bayes algorithm is a classification algorithm that, in a genome screening-based approach, sequenced the human genome and many model organism genomes. As shown in Table 3 and Figure 6, for data processing that is not discretized and contains missing values, , , , , , , and . It provides new opportunities for the identification of protein targets for small molecule therapeutics. With better classification results, new chemical genomics and genomics approaches link small molecules to their protein targets. Chemical proteomic methods may also facilitate the identification of protein targets, , , , , , , and . It is used for scenarios such as efficient classification of multidimensional feature data. It will favor feature data with more attribute values and use drug affinity chromatography combined with mass spectrometry and computational analysis to classify whole protein small molecule-protein interactions. In the compound-centric chemical proteomics method, which has an impact on the construction of the decision tree and the final classification effect, the molecules are fixed on a substrate to maintain their activity and improve the accuracy of the algorithm.

4.4. Iterative Optimization

The EM expectation-maximum algorithm is an iterative optimization algorithm followed by incubating the cell lysate of interest with an affinity matrix. Looking for parameter maximum likelihood estimates, eluted proteins were processed without gel. As shown in Table 4 and Figure 7, the comparison of the three models shows that Naive Bayes is the best model, , , , , , , and . Each iteration consists of alternating expected and maximal steps, which are then identified and quantified by MS. An advantage of chemical proteomics is the ability to probe the entire proteome until convergence ends. In the Okumura-Hata model, , , , , , , and . Small molecules that encounter and interact with these proteins in their natural state and environment serve as a data addition algorithm. Another advantage of this is that proteomics can be tested in any cell type or tissue of interest, guaranteeing a steady rise in parameter value estimates over an iterative process. AdaBoost model has the worst effect, , , , , , , and . Iterative optimization can analyze the trend of information gain rate from a large number of classification algorithm data. Therefore, the decision tree model is constructed according to the selection attributes, which is the core key to solving complex problems. The top-down recursive solution is accurate and complete. The rules for mapping attribute values to categories are a series of clear instructions for solving problems.

5. Conclusion

In big data analysis with the rapid improvement of computer storage capacity and the rapid development of complex algorithms, the exponential growth of massive data has also made science and technology progress with each passing day. Based on omics data such as mRNA data, microRNA data, or DNA methylation data, this study uses traditional clustering methods such as kmeans, K-nearest neighbors, hierarchical clustering, affinity propagation, and nonnegative matrix decomposition to classify samples into categories, and obtained: (1) The assumption that the attributes are independent of each other reduces the classification effect of the algorithm to a certain extent. According to the idea of multilevel grid, the mapping from high-dimensional space to one-dimensional space is one-to-one correspondence. The complexity is greatly simplified by encoding the one-dimensional grid of the hierarchical grid. The logic of the algorithm is relatively simple, and it also has a very stable classification efficiency. (2) Convert the two-dimensional representation of the data to the one-dimensional representation of binary, , , , , , , and . It realizes the dimensionality reduction processing of data and improves the efficiency of data organization and storage. The grid coding expresses the spatial position of the data, maintains the original organization method of the data, and does not make the abstract expression of the data object. (3) For data processing that is not discretized and contains missing values, , , , , , , and . It provides a new opportunity for the identification of protein targets of small molecule therapy and obtains a better classification effect. Chemical proteomics methods may also facilitate the identification of protein targets, , , , , , , and , for multidimensional feature data analysis. It will favor feature data with more attribute values. (4) The comparison of the three models shows that Naive Bayes is the optimal model, , , , , , , and . Each iteration consists of alternating expected and maximal steps, which are then identified and quantified by MS. An advantage of chemical proteomics is the ability to probe the entire proteome until convergence ends.

Data Availability

The experimental data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declared that they have no conflicts of interest regarding this work.

Authors’ Contributions

Chaohui Xiao and Fuchuan Wang contributed equally to this work.