Scientific Programming for Multimodal Big DataView this Special Issue
Research Based on Multimodal Deep Feature Fusion for the Auxiliary Diagnosis Model of Infectious Respiratory Diseases
Pulmonary infection is a common clinical respiratory tract infectious disease with a high incidence rate and a severe mortality rate as high as 30%–50%, which seriously threatens human life and health. Accurate and timely anti-infective treatment is the key to improving the cure rate. NGS technology provides a new, fast, and accurate method for pathogenic diagnosis, which can provide effective clues to the clinic, but determining the true pathogenic bacteria is a problem that needs to be solved urgently, and a comprehensive judgment must be made by the clinician combining the laboratory results, clinical information, and epidemiology. This paper intends to effectively collect and process the missing values of NGS data, clinical manifestations, laboratory test results, imaging test results, and other multimodal data of patients with infectious respiratory diseases. It also studies the deep feature fusion algorithm of multimodal data, couples the private and shared features of different modal data of infectious respiratory diseases, and digs into the hidden information of different modalities to obtain efficient and robust shared features that are conducive to auxiliary diagnosis. The establishment of an auxiliary diagnosis model for the infectious respiratory diseases can intelligentize and automate the diagnosis process of infectious respiratory, which has important significance and application value when applied to clinical practice.
With the advent of the big data era, data has flooded all aspects of society. For modern medicine, the human body has become a big database, and various medical data make modern medicine show obvious data characteristics. Data, especially medical data, need to be reviewed dialectically. With the combination of big data analysis technology and biomedicine, various computational modeling methods (pattern recognition, data mining, machine learning, deep learning, etc.) have been applied to the field of medicine. Based on this, we design and establish research based on high-throughput pathogen detection system of the artificial intelligence high-performance computing platform of the sequencing platform which establishes a high-order tensor database for infectious respiratory diseases and a multimodal database that combines imaging, laboratory examination results and clinical manifestations, based on artificial intelligence for disease exploring and unifying the treatments of patients and establishing a treatment query system. This paper aims to study the combination of NGS data with clinical data and epidemiological data with the help of in-depth calculation models, in the diagnosis and treatment of infectious respiratory diseases’ application.
Pulmonary infection is a common respiratory infectious disease in clinical practice with a high incidence. It ranks the first cause of death in countryside and the third in urban areas of China, especially severe pneumonia. It has an increasing trend in recent years, although the treatment methods have great progress than before, its fatality rate is still as high as 30% to 50%, which seriously threatens human life and health. The rapid and accurate diagnosis of pathogenic bacteria of respiratory tract infection is the key to treatment, which can help clinicians to optimize the use of antibacterial drugs in a timely manner, thereby speeding up recovery, increasing cure rate and improving prognosis. At present, the commonly used methods of microbial detection such as smear, culture, and polymerase chain reaction cannot effectively meet the clinical needs. Genome analysis second-generation sequencing technology (mNGS, also known as high-throughput sequencing technology) provides a new, rapid, and accurate method for pathogen diagnosis. Compared with traditional pathogenic microorganism detection, mNGS has high sensitivity and large amount of information. It can detect pathogens early, guide the precise selection of antibacterial drugs, reduce the use of antibacterial drugs, reduce the mortality of patients, and can identify new/known pathogens infection and mixed infection.
2. Current Research Status
At present, the cause of 60% infectious diseases is still unclear . Clinical metagenomics is a detection technology that uses high-throughput sequencing technology to clarify the classification and function of all microorganisms in a sample without relying on traditional microbial culture [2, 3]. This technology can simultaneously detect bacteria, fungi, viruses, and parasites in the same sample without any bias and does not require specific amplification. It is suitable for the investigation of infectious disease outbreaks of unknown pathogens and infection with negative results from traditional tests, immunodeficiency patients, and critically ill infected patients . For special populations such as infants and young children, patients with advanced age or patients with underlying diseases, immunodeficiency populations, repeated hospitalizations, patients with repeated negative tests of traditional microbial detection techniques and poor treatment effects, patients with suspected infections of special pathogens, patients with unexplained infectious diseases, and patients with critical illness, it is necessary to identify pathogenic bacteria as soon as possible. On the one hand, due to the complexity of pathogenic microorganisms, traditional opportunistic pathogens may become the main pathogenic microorganisms; on the other hand, pathogenic microorganisms carry multiple antibiotic resistance genes ; in this case, clinical metagenomics is the best diagnostic option [4, 6–8].
3. Application of NGS in Detection of Pulmonary Pathogen Infection
The narrow clinical metagenomics technology mainly refers to the shotgun next generation sequencing technology. The main sequencing process is to break all the DNA in the sample into small fragments first and then build a library and sequence it on the computer. The informatics method splices the sequencing results and finally compares the database to clarify the detected species . The broad clinical metagenomics technology also includes second-generation sequencing technology, which mainly includes sequencing technology of detection bacterial 16S ribosomal RNA and amplification subsequencing technology of detection of fungal internal transcribed spacer (ITS). The main sequencing process is to obtain all the DNA in the sample first, then use primers for specific bacteria or fungi to perform PCR amplification, build a database and sequence on the computer, use bioinformatics methods to obtain qualified sequencing data, and finally compare the database to clearly detect the species . It is worth mentioning that clinical metagenomics can simultaneously identify bacteria, fungi, viruses, and protozoa in sample, can be accurate to the species level, and can also identify the antibiotic resistance of microorganisms and other functions, while amplicon sequencing technology can only identify bacteria or fungi in the sample that is accurate to the genus level, and the microbial related functions can only be inferred from the database .
Clinical metagenomics is considered to be the most powerful weapon to identify pathogens of infectious diseases , but there is no unified clinical application path yet. We combined the work characteristics of clinicians, laboratory technicians, and bioinformatics analysts in the clinical application of clinical metagenomics and summarized the application mode of clinical metagenomics in the precision diagnosis and treatment of respiratory infectious diseases. This model requires communication among clinicians, laboratory technicians, and bioinformatics analysts in order to obtain the most effective data and give precise medication.
The samples of patients with respiratory tract infection mainly include sputum, airway aspirate, and alveolar lavage fluid. Moran Losada et al.  used clinical metagenomics to detect induced sputum samples from patients with cystic pulmonary fibrosis at different ages and confirmed that 99% of respiratory tract microorganisms are hundreds of bacteria, mainly Pseudomonas aeruginosa and Staphylococcus aureus is predominant, while 10 types of fungi and viruses account for only about 1% of the respiratory tract microorganisms. The fungi are mainly Candida and Aspergillus, and the viruses are mainly adenovirus and herpes virus. The study also clarified that, in each respiratory sample, there is abundance of microorganism; in addition, the study confirmed the relevant antibiotic resistance genes of Pseudomonas aeruginosa and Staphylococcus aureus, which provide a basis for the precise selection of antibiotics. Langelier et al.  enrolled 22 bone marrow transplant patients admitted to hospital for lower respiratory tract infection and used clinical metagenomics to detect 250 µl of alveolar lavage fluid samples from each patient, and the results confirmed the existence of lungs in bone marrow transplant patients with acute respiratory infections HCOV229E, HRV-A, HHV-6, CMV, HSV, EBV, human papilloma virus, torque Tenuo virus, and other viruses, and there are also rare pathogenic bacteria: Streptococcus mitis (Streptococcus mitis) and corynebacterium (Corynebacterium propinquum), and the clinical symptoms of patients with coexistence of bacteria and viruses are significantly more severe. In addition, clinical metagenomics has also been used to clarify the characteristics of lung microbes in lung transplant patients secondary to lung infection .
4. Preprocessing of Multimodal Clinical Data of Infectious Respiratory Diseases
In view of the fact that there is no unified standard for the scope of data retrieval and database establishment of existing infectious respiratory disease cases, through the retrospective data sorting and historical data follow-up, a large number of new infectious cases and the result are collected and tested. Derive complete high-throughput genomics data and clinical association data of pathogenic microorganisms, formulate data retrieval range, and summarize case data. Aiming at the problems of data missing and inaccurate data in aggregated multimodal data, the incomplete data-filling algorithm based on distributed subtraction clustering is studied. The incomplete data are clustered by an improved subtractive clustering algorithm, and then, the incomplete data is filled with the clustering result and weighted distance. Thereby, the data with missing attribute values can be filled in quickly and accurately, so as to prepare for subsequent tasks such as data mining and analysis:(1)Collection and collation of case data of pulmonary infectious diseases formulate the definition, inclusion, and exclusion criteria of cases of infectious lung diseases. According to research needs, in accordance with the research plan approved by the unit’s ethical approval and with the patient’s informed consent, collect the case data of pathogenic microorganism genetic testing in our hospital’s “National Gene Testing Application Demonstration Center” since 2018 and trace their outpatient or hospitalization information and relevant clinical data. Retrieve data through the hospital's HIS system, LIS system, and PACS system and formulate the scope of data retrieval including the name of the medical institution, unique ID number, date of onset or medical consultation, basic personal information (gender/date of birth/occupation, etc.), medical treatment department, main symptoms and signs, past history, chief complaint, main diagnosis, imaging examination, and laboratory examination (blood routine, CRP, pct, d-dimer, interleukin-6, G/GM test, Aspergillus antibody, new type Coccus capsular antibody, tuberculosis antibody, etc.). Download the diagnosis and treatment information according to the established information catalog to form the original csv database. The case data information of the target case is screened according to the researched infectious case definition and inclusion and exclusion criteria, and the infectious case data statistical table is formed. Finally, the formed data statistical table is summarized.(2)Data filling of cases of lung infectious diseases.
Firstly, it studies the optimization of subtractive clustering algorithm by using the similarity measurement method of incomplete data and the idea of matrix multiplication and realizes the direct clustering of incomplete datasets based on the distributed subtractive clustering algorithm of multilevel MapReduce. The main time to execute the algorithm is spent on dividing the dataset S, calculating the Euclidean distance between sample points and calculating the density index of sample points. In order to reduce the time cost of the algorithm and improve the efficiency of the algorithm, for these three steps, a multilevel MapReduce process is used for distributed parallel computing. In order to make the division of the dataset S suitable for the MapReduce calculation model, the data to be processed is first stored in the form of rows so that it can be sliced by rows, and the data between slices has no correlation. In the process of subtractive clustering, the calculation of the neighborhood radius and the density of sample points need to use the distance between samples, so it is particularly important to generate the distance matrix between sample points. In order to make the data subset C suitable for MapReduce calculation model processing and then generate the distance matrix, this project uses two copies of the data subset C as the calculation matrix to perform the MapReduce implementation of matrix multiplication. In the process of using subtractive clustering to cluster the complete data subset C, it is necessary to calculate and modify the density index. It can be known from the density index formula that all the values in the ith row of the distance matrix G correspond exactly to the elements of the density index of the data object i. This feature ensures that the correction calculation of the density index is suitable for MapReduce parallel design.
After clustering incomplete data, the method of filling the missing data by studying the distance weighting coefficient between the data objects and data points in the same class is used to avoid the interference of other objects on the filling value. The key of this method is to determine the weighting coefficient of each data object. In order to determine the weighting coefficient objectively and accurately, this article uses the following formula to calculate the distance between the data objects to provide the weighting coefficient:where Dis (Si,Sj) represents the distance between the data object Si and Sj, m is the number of attributes of the data object, and m′ is the number of the same attributes of the two data objects that is not missing. Finally, fill incomplete data based on clustering and weighted distance.
5. Deep Feature Fusion Learning Model Based on Multimodal Data of Infectious Respiratory Diseases
This paper studies the deep nonnegative correlation feature fusion algorithm of multimodal data. Through the co-learning of unsupervised related and unrelated features, the influence of modal private features is removed from multimodal shared features, and the shared space is more effective and robust and the shared space is more effective and robust of the multimodal data-related fusion features. Research the deep migration feature fusion algorithm of unbalanced multimodal data, coupling the modal deep network and the modal semantic correlation model, and design a unified deep network architecture based on multilayer semantic matching.(1)Unsupervised multimodal data deep nonnegative correlation feature fusion algorithm Given a multimodal dataset , it contains n data instances under V modes, which represents the feature matrix of n data instances under the th mode, and each data instance is represented as a dv-dimensional feature vector. First, the structured sparse projection matrices and are used to convert the feature matrix of each mode into a mode private feature matrix and a mode shared feature matrix VC. Then, based on the regularization of the invariant graph and the sparse projection limit, the multimodal reconstruction error function is constructed, and the function variables are jointly optimized to minimize the reconstruction error through the shared feature coupling. Finally, the cluster analysis of the data is completed on the obtained multimodal-shared feature VC.(2)Deep migration feature fusion algorithm for unbalanced multimodal data Based on typical correlation analysis (CCA), this project intends to construct multilayer semantic correlation model of cross-modal data. Typical correlation analysis model can project different data domains to related feature representation subspace through effective matrix conversion. The correlation between data domains is the greatest. To implement the model, first,  is encoded using source and target domain depth networks, respectively, to learn the hidden layer data feature representation corresponding to the source and target domain and , where f is the nonlinear activation function of the deep learning network. Then, typical correlation analysis is carried out on the obtained domain hidden layer features and . The maximum correlation coefficient matrix corresponding to the learning source and target domain and :
Match the features of the first layer to a more similar modal semantic space through the correlation coefficient matrix and then carry out the semantic correlation of the next layer. The coupled modal deep network is related to each layer of modal semantics, and a deep multimodal multilayer semantic matching model can be obtained, which is defined as minimizing the reconstruction error of the source and target deep learning network, while maximizing the correlation of the cross-domain deep network. The specific objective function is as follows:
and are the reconstruction errors of the source and target depth networks, including cost functions and parameter regularization terms, respectively.
6. Establish a Whole-Process Auxiliary Diagnosis and Reasoning Model for Infectious Respiratory Diseases
This paper takes expert experience as the core, uses existing medical dictionaries, electronic medical records, various medical guidelines, expert consensus, and other basic data to construct a domain knowledge map, and realizes it through knowledge extraction and knowledge fusion technology. Combine the in-depth feature fusion learning results of the multimodal data of infectious respiratory diseases, based on the knowledge map, and refer to the overall diagnosis process of infectious respiratory diseases in the hospital at this stage, establishing the whole process assistance for the infectious respiratory diseases’ diagnostic reasoning model.
6.1. Construction of Knowledge Map of Infectious Respiratory Diseases
The data sources used to construct the knowledge graph can be divided into the following types.
Structured data: structured data extraction is done through the data integrator. The data integrator is divided into three parts: data integration design tools, data integration conversion tools, and data read-write plug-ins. Data integration design tools are used to provide users with graphical design data integration logic functions, data integration conversion tools are used to convert user designs into data integration application codes, and data read-write plug-ins are used to provide data read-write functions for data integration applications.
Semistructured data: semistructured data is characterized by a certain implicit structure, but its structure changes greatly and lacks standardization. Two types of semistructured data, encyclopedia websites and industry vertical websites, can usually be used to construct knowledge graphs in vertical domains. These data are all HTML-based Web data, and the web page elements to be extracted can be located through their label symbols. The web page mainly consists of the entry card at the top, the free-form text in the middle part, and the entry label at the bottom. The label structure of the entry card and entry label is relatively fixed. It can extract the required entity name, entity description, entity attribute, and relationship with other entities from the entry card part. The type of entity can be obtained directly or indirectly from the entry label. The free-form text part in the middle needs to extract the required knowledge through a long- and short-term memory network (LSTM).
Unstructured data: unstructured text is processed, through the named entity recognition method, in which the entity and the category of the entity have been identified. Then, the semantic relationship between entities is extracted from the text through the relationship extraction module. For this task, first, train a relationship classifier, through which it determines whether there is a certain predefined relationship between two entities in a piece of text. This is essentially a classification problem of sequence data, which is solved by using a relationship extraction method based on remote supervision.
6.2. Auxiliary Diagnosis and Reasoning Model for Infectious Respiratory Diseases
The auxiliary diagnosis and reasoning model of infectious respiratory diseases is based on the domain knowledge map. After the entities and relationships of the examples are embedded, the encoding part and the decoding part are designed, and finally, the infectious respiratory diseases are classified and predicted. The coding part first constructs a convolutional layer to process multimodal data, inserts an attention module to extract features of instance data, and then combines the deep feature fusion model studied in this project to explore the deep information of different modal data. The decoding part finally predicts the type of disease in the case to achieve the purpose of auxiliary diagnosis.
In the era of big data, with the rapid development of multimedia technology and the richness of data description methods, multisource, heterogeneous, and other multimodal data are widely available [16, 17]. Multimodal data refers to data obtained through different fields or perspectives for the same description object. By using the complementation of information between modalities, more accurate data characteristics can be learned, and subsequent data prediction and decision-making tasks can be effectively supported [18–20]. Feature learning of multimodal data requires effective data fusion methods. However, in practical applications, multimodal data usually has low-quality characteristics such as inaccuracy, incompleteness, and imbalance: inaccuracy refers to the possibility in multimodal data. It will contain nonrelated information such as noise or irrelevant items; incompleteness means that part of the modal information or part of the attribute information of some data instances in the multimodal data is missing; imbalance means that there are more instances of some modal data. And, other modal data instances are relatively small, so it is necessary to use modalities containing more instances to assist modalities containing fewer instances for analysis and learning. The abovementioned characteristics pose great challenges for the design of multimodal data fusion methods.
Deep neural networks can effectively filter data noise and deep abstract features of learning data through multilayer nonlinear conversion and promote similar semantic fusion . Therefore, this project extends the deep neural network to inaccurate, incomplete, and unbalanced multimodal data and studies the corresponding in-depth fusion algorithm of low-quality multimodal data. Through the multilayer correlation and matching of modal data, a cross-modal integration deep feature fusion model of coupled modal network and shared features is obtained.
The data used to support the findings of this study are currently under embargo, while the research findings are commercialized.
Conflicts of Interest
The authors declare no conflicts of interest regarding the publication of this paper.
The data used to support the findings of this study are currently under embargo, while the research findings are commercialized. Jingyuan Zhao and Liyan Yu contributed equally as co-first authors.
This study was partially funded by the National Natural Science Foundation of China (81871725) and the Foundation of Education Department of Liaoning Province (LZ2020010).
E. Ruppé, A. Cherkaoui, V. Lazarevic, S. Emonet, and J. Schrenzel, “Establishing genotype-to-phenotype relationships in bacteria causing hospital-acquired pneumonia: a prelude to the application of clinical metagenomics,” Antibiotics, vol. 6, no. 4, pp. 30–45, 2017.View at: Publisher Site | Google Scholar
D. W. Lewandowska, P. W. Schreiber, M. M. Schuurmans et al., “Metagenomic sequencing complements routine diagnostics in identifying viral pathogens in lung transplant recipients with unknown etiology of respiratory infection,” PLoS One, vol. 12, no. 5, Article ID e0177340, 2017.View at: Publisher Site | Google Scholar
Y. Zhao, L. Bo, X. Hua et al., “Preface to the special topic on multimedia big data processing and analysis,” Journal of Software, vol. 29, no. 4, pp. 897–899, 2018.View at: Google Scholar
H. Lin, Y. Wang, Y. Jia et al., “Overview of knowledge fusion methods for network big data,” Chinese Journal of Computers, vol. 40, no. 1, pp. 1–27, 2017.View at: Google Scholar
Q. Zhao and Z. Li, “Cross-modal social image clustering,” Chinese Journal of Computers, vol. 41, no. 1, pp. 98–111, 2018.View at: Google Scholar
L. Zhao, Z. Chen, L. T. Yang, M. J. Deen, and Z. J. Wang, “Deep semantic mapping for heterogeneous multimedia transfer learning using co-occurrence data,” ACM Transactions on Multimedia Computing, Communications, and Applications, vol. 15, no. 1s, pp. 1–21, 2019.View at: Publisher Site | Google Scholar