Abstract

This study proposed a novel algorithm to investigate the risk factors for complex diseases. We employed the novel algorithm to determine the risk factors for depressive disorder, osteoporosis, and fracture in young patients with breast cancer who were receiving curative surgery. The novel algorithm has three steps. First, multiple correspondence analysis (MCA) is used to transform the raw data set into a multidimensional coordinate matrix. Second, the expectation-maximization (EM) algorithm is used for clustering the multidimensional coordinates for each category of variable. Third, -fold cross-validation is incorporated into the coordinate matrix obtained using the MCA-EM algorithm to determine the optimal clustering of complex diseases and risk factors. A total of 4108 patients with breast cancer aged 20–39 years were enrolled. The results revealed that depressive disorder, osteoporosis, and fracture were clustered with liver cirrhosis, chronic obstructive pulmonary disease (COPD), distant metastasis, and primary metastatic and adjuvant therapies, namely, chemotherapy, radiotherapy, tamoxifen, aromatase inhibitors, and trastuzumab. Among the risk factors identified using this novel algorithm, liver cirrhosis and COPD have been rarely mentioned in the literature. In conclusion, the novel algorithm proposed in this study enables physicians and clinicians to identify risk factors for multiple diseases.

1. Introduction

Patients with cancer experience numerous side effects or complications as a result of chemotherapy, radiotherapy, surgery, or other treatments [1, 2]. Due to recent advances in new treatments, particularly for patients with breast cancer, the survival duration of patients has been significantly prolonged [3, 4]. However, breast cancer remains the most prevalent malignant neoplasm diagnosed among women worldwide and is the leading cause of death among female cancers [57]. In Taiwan, breast cancer is the most prevalent malignant neoplasm [8]. Aging is a high risk factor for breast cancer [9, 10], and menopause may also be considered a crucial risk factor for breast cancer [9, 11]. However, younger patients with breast cancer can receive (or tolerate) more treatments than older patients, which implies that younger patients are likely to experience more complications or side effects, such as fatigue, vomiting, and hair loss, which are common among women with breast cancer receiving chemotherapy or radiotherapy. In addition, some patients may exhibit more severe complications (or diseases), such as depressive disorder [12], osteoporosis [13], and fracture [14]. Therefore, patients with cancer may experience multiple side effects, complications, or diseases simultaneously. However, most published studies have investigated the risk factors associated with a single disease [12, 14, 15]; few have investigated the risk factors associated with multiple diseases resulting from adjuvant treatments of breast cancer. This study developed a novel algorithm to determine the risk factors for multiple diseases, namely, depressive disorder, osteoporosis, and fracture, in young patients with breast cancer receiving curative surgery. Data in the Taiwan National Health Insurance Research Database (NHIRD) were employed by the algorithm.

2. Materials and Methods

2.1. Study Database

The study data were retrieved from a population-based claims database, the NHIRD, in Taiwan. The National Health Insurance (NHI) program in Taiwan was launched on March 1, 1995, and it provided health care coverage to more than 99% of the Taiwanese population in 2010 [16]. The NHIRD contains population-based health care information, including outpatient and inpatient clinic or hospital visits, dental service visits, and traditional Chinese medicine services. The diagnostic and medical procedures for diseases are based on the International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM) and Procedure Coding System for every medical service claim.

2.2. Ethics Statement

The ethical review of the present study was approved by the Institutional Review Board of the School of Nursing, National Taipei University of Nursing and Health Sciences (CN-IRB-2011-063). In the NHIRD, the personal information of study patients is fully encrypted using double encryption procedures by the National Health Insurance Administration (NHIA). Because this study had a secondary database study design, written informed consent forms from the enrolled patients did not need to be obtained. The NHIA fully guarantees the confidentiality of the personal and health information of the study patients.

2.3. Study Population Selection Process

At the beginning, we retrieved from NHIRD from 2001 to 2007 by using the breast cancer ICD-9-CM code = 174.XX, and 73,385 patients were selected. Because we need to select newly diagnosed breast cancer patients, we need to exclude breast cancer patients who were diagnosed before January 1, 2002 (). Besides, we excluded death patients during 2001 to 2007 (). We also excluded patients diagnosed with baseline depressive disorder (ICD-9-CM codes: 296.2X-296.3X, 300.4, and 311.X), bipolar disorder (ICD-9-CM codes: 296.0, 296.1, 296.4, 296.5, 296.6, 296.7, 296.8, 296.80, and 296.89), alcohol-use-related mental disorders (ICD-9-CM codes: V113, 9800, 2650, 2651, 3575, 4255, 3050, 291, 303, and 571.0–571.3), and osteoporosis (ICD-9-CM code: 733.XX) or with fracture history (ICD-9-CM code: 800.XX–829.XX), and male breast cancer patients (). There were 32,776 newly diagnosed breast cancer patients left. Because this study was aimed to investigate risk factors for complex diseases, including depressive disorder, osteoporosis, and fracture in young breast cancer patients receiving curative surgery, patients aged 20–39 years who had received a new diagnosis of breast cancer (ICD-9-CM code 174.XX) between January 1, 2001, and December 31, 2007, were further selected. All young patients with breast cancer who received curative surgery for the first time from 2001 to 2007 were finally recruited. Young patients with breast cancer who developed depressive disorder (ICD-9-CM codes: 296.2X-296.3X, 300.4, and 311.X), osteoporosis (ICD-9-CM code: 733.XX), and fracture (ICD-9-CM codes: 800.XX–829.XX) after curative surgery were enrolled. At the end of the selection process, a total number of 4108 young breast cancer patients were selected for this study. The patient recruitment scheme is presented in Figure 1.

2.4. Algorithm

The research database contains several categorical variables including binomial or multinomial variables. The first step of data analysis is to convert a data matrix containing categorical variables into a matrix containing index variables (0 or 1) through multiple correspondence analysis (MCA) [17]; the resultant matrix is called the Burt table, and each index variable indicates each level in all categorical variables. Furthermore, each index variable can be transformed into Euclidean coordinates in a higher dimensional space. Second, the resulting Euclidean coordinate matrix obtained through MCA can be considered as a high-dimensional data set or mixture distribution data set; the EM algorithm has been proven effective in determining hidden clusters among a high-dimensional or mixture distribution data set [1820]. Third, after determining the optimal number of clusters, we used the -fold cross-validation technique to identify the optimal clustering. The steps of the whole algorithm are detailed as follows:

Step 1 (MCA). Let be the raw data matrix with as the subjects and as the categorical variables. (1)Convert the raw data matrix into the so-called Burt matrix: (i)If one categorical variable is the binary variable, let it be considered the original variable type in the Burt matrix.(ii)If one categorical variable has more than two levels (i.e., levels), convert this variable into a so-called indicator matrix, , in which each column contains binary variables coded 0 or 1.(iii)Place all the binary variable columns together to form the indicator matrix .(iv)Generate the Burt matrix as = X’X.(2)Calculate the column and row coordinates: (i)The grand total of is ; calculate the probability matrix as .(ii)Let denote the vector of the row totals of (i.e., , where is a conformable vector comprised of 1’s), and denote the vector of the column totals of ; , and .(iii)Calculate the coordinate scores by using the singular value decomposition method: where is the diagonal matrix of singular values, and is the matrix containing eigenvalues. The so-called row coordinates and column coordinates are thus obtained as follows: (3)Determine the number of dimensions by using inertia estimation: (i)The Pearson chi-squared () based distances from rows and columns to their respective point coordinate centers are calculated as(ii)If we select a subset of or , the inertia for the row coordinates and column coordinates for each level is given by where and are the subsets of and .

Step 2. Expectation-maximization (EM) algorithm for clustering (1)For a high-dimensional data set or mixture distribution data set, assuming that each variable in or is a random variable, the matrix can be modeled as a mixture normal-distribution probability density function as where and are the mean and standard deviation for each variable in or .(2)We use coordinates for each level of each categorical variable obtained using the MCA method, assuming that the number of dimensions selected using Inertia, which is previously obtained through MCA, is . (i)Combine density functions to model a mixture distribution: The th normal distribution is parametrized using (ii)Given , the log-likelihood function of is as follows:(3)For EM formulation, maximize the following function if exists which can maximize the following: (i)E-step (expectation step): compute the likelihood of . (ii)M-step (maximization step): update , , and for each categorical variable; for example, :

Step 3. Apply -fold cross-validation to the results of Step 2 to determine the optimal number of clusters (i)Randomly divide the overall sample into a number of -folds (in this study, we used ).(ii)All the levels of categorical variables with coordinates can be classified into ()-folds as the training sample and 1 testing sample, which can be applied to -fold.(iii)The results for the replications are aggregated using −2log(likelihood) as the measurement of clustering cost (the lower, the more favorable) and are shown in a scree plot for determining the optimal number of clusters.

The MY Structured Query Language was used for database preprocessing, which comprised extraction, linkage, and cleaning of NHIRD data in this study. After the database was processed based on the inclusion and exclusion criteria, the study data sets were obtained. All statistical analyses were performed using STATISTICA (version 10 for Windows; Statistica, Tulsa, OK, USA), and a two-tailed was considered statistically significant.

3. Results

According to the recruitment scheme outlining patient selection (Figure 1), 4108 patients aged between 20 and 39 years who had received a diagnosis of breast cancer between January 1, 2001, and December 31, 2007, were recruited. The mean age of the patients was 34.6 years (standard deviation [SD] = 3.7 years). Of all the patients, 3.1% had depressive disorder, 0.7% had a fracture event, and 1.7% had osteoporosis. Regarding comorbidities, 1.7% patients had diabetes mellitus, 1.8% had hypertension, 0.7% had a history of heart failure, 0.5% had coronary heart disease, 0.4% had cerebrovascular disease, 2.3% had autoimmune disease, 0.9% had kidney disease, 0.4% had renal disease, 0.3% had liver cirrhosis, and 5.3% had chronic obstructive pulmonary disease (COPD). In addition, 13.4% exhibited distant metastasis and 6.9% exhibited primary metastasis. Regarding adjuvant therapies, 52.7% were receiving chemotherapy, 23.9% were receiving radiotherapy, 63.9% were receiving tamoxifen-related treatments, 7.5% were receiving aromatase inhibitor- (AI-) related treatments, and 3.2% were receiving trastuzumab treatments (Table 1).

After incorporating -fold cross-validation into the EM clustering algorithm of the MCA coordinate matrix, four clusters were revealed to have the smallest −2log-likelihood value; therefore, a four-cluster classification was defined as the optimal clustering approach (Figure 2).

Plots of each level of all variables in this study are presented in Figures 3(a) and 3(b). Figure 3(a) presents the four clusters obtained for all the points (levels of categorical variables in the study) in four colors without labels in a three-dimensional scatter plot, whereas Figure 3(b) has each point with labels. The three outcome variables—depressive disorder, fracture, and osteoporosis—were clustered with chemotherapy, radiotherapy, tamoxifen, AIs, trastuzumab, primary metastasis, distant metastasis, and comorbidities including liver cirrhosis and COPD (Figure 3(b)). The clustering results are also tabulated in the right column of Table 2, together with a comparison with two published studies.

4. Discussion

The present study proposed a novel algorithm and used the NHIRD, a population-based database, to investigate risk factors for multiple diseases, namely, depressive disorder, osteoporosis, and fracture, among young patients with breast cancer who were receiving curative surgery. The study results revealed that depressive disorder, osteoporosis, and fracture were clustered with chemotherapy, radiotherapy, tamoxifen, AIs, and trastuzumab treatments; this finding is in agreement with the findings of other studies [12, 14]. In addition, depressive disorder, osteoporosis, and fracture were clustered with distant metastasis and primary metastasis. Depressive disorder was associated with distant and primary metastases; these associations were proposed in published studies [21, 22]. However, depressive disorder, osteoporosis, and fracture were also clustered with liver cirrhosis and COPD, which have rarely been investigated. Liver cirrhosis has been linked to increased levels of estrogen, which may be a risk factor for breast cancer [23]. In addition, a study from 1992 demonstrated that breast cancer may be associated with antigen CA-153 in liver cirrhosis [24]; however, more recently published articles have rarely mentioned this association. Studies on the association of COPD with depressive disorder, osteoporosis, and fracture among young patients with breast cancer have not been conducted in recent years; therefore, further investigation of the association of COPD with depressive disorder, osteoporosis, and fracture among young patients with breast cancer may be required.

The present study adopted a different approach to identifying risk factors for multiple diseases: we used “clustering” instead of typical statistical analysis methods (e.g., logistic regression and Cox regression). However, most studies have addressed univariate outcome variables (e.g., overall survival for death outcome, disease- (or progression-) free survival for recurrence of some disease, and binary outcome variable with/without some disease). Studies that have employed univariate statistical analysis methods can provide information on risk factors for some specific disease onset. However, patients with cancer usually experience multiple and complex symptoms or disease onsets, and studies or analysis methods addressing multiple concurrent diseases are still very limited. In this study, we proposed a novel algorithm that can analyze multiple outcome variables, and the findings provide clinical implications for clinicians that are not solely based on the results of univariate analyses.

The present study had some limitations. First, the NHIRD is a medical claims database, which does not include health behavior variables such as smoking behavior, alcohol consumption, lifestyle, and exercise, which may be associated with the risk of depressive disorder, osteoporosis, and fracture. Second, the NHIRD does not provide information on cancer staging, genetic mutations, or some environmental factors, which may be potential confounders associated with the risk of depressive disorder, osteoporosis, and fracture. Third, the patients enrolled in this study were mainly of Chinese ethnicity; thus, the study results derived from the proposed novel algorithm may not be generalizable to other ethnic populations.

In conclusion, the present study proposed a novel algorithm that can manage or cluster multiple disease outcomes with potential risk factors by using a large-scale population-based database. This novel algorithm can be straightforwardly applied to other diseases to help clinicians identify more potential risk factors if they plan to consider the potential risk factors associated with multiple disease outcomes.

Abbreviations

MCA:Multiple correspondence analysis
EM:Expectation-maximization
SD:Standard deviation
NHIRD:National Health Insurance Research Database
NHIA:National Health Insurance Administration
COPD:Chronic obstructive pulmonary disease
AI:Aromatase inhibitor
MOHW:Ministry of Health and Welfare.

Data Availability

Due to the Personal Information Protection Act of Taiwan, the National Health Insurance Research Database (NHIRD) has been prohibited from releasing insuree’s medical claim data via applications since June 28, 2016 (https://nhird.nhri.org.tw/apply_00.html). The Ministry of Health and Welfare (MOHW) of Taiwan decided that all NHIRD data analyses must be processed in the Health Data Science Center, which was established by the MOHW. Researchers who are interested in NHIRD research can seek access to the NHIRD by formal application (application website: https://dep.mohw.gov.tw/DOS/np-2497-113.html). All the authors of this paper understand and appreciate the need for data transparency in research and are ready to make the data available to those who have a research interest in accessing NHIRD data. Please direct your queries to NHIRD administrators Tze-Hui Wu ([email protected]) or Zong-Ying Lin ([email protected]).

Conflicts of Interest

The authors declare no conflicts of interest.

Authors’ Contributions

Chieh-Yu Liu was the principal investigator in charge of this study, conceptualizing the study design, applying the study database, coding the algorithm, conducting data analysis, drafting the initial manuscript, and critically reviewing and approving the final, submitted version of the manuscript. Chun-Hung Chang was the research assistant of this study, providing technological support for database maintenance and administrative support. All the authors read and approved the final manuscript.

Acknowledgments

The authors thank the Health Data Science Center of the Ministry of Health and Welfare of Taiwan for providing access to the National Health Insurance Research Database for data analysis in this study. This manuscript was edited by Wallace Academic Editing.