The transition of cancer from a localized tumor to a distant metastasis is not well understood for prostate and many other cancers, partly, because of the scarcity of tumor samples, especially metastases, from cancer patients with long-term clinical follow-up. To overcome this limitation, we developed a semi-supervised clustering method using the tumor genomic DNA copy number alterations to classify each patient into inferred clinical outcome groups of metastatic potential. Our data set was comprised of 294 primary tumors and 49 metastases from 5 independent cohorts of prostate cancer patients. The alterations were modeled based on Darwin’s evolutionary selection theory and the genes overlapping these altered genomic regions were used to develop a metastatic potential score for a prostate cancer primary tumor. The function of the proteins encoded by some of the predictor genes promote escape from anoikis, a pathway of apoptosis, deregulated in metastases. We evaluated the metastatic potential score with other clinical predictors available at diagnosis using a Cox proportional hazards model and show our proposed score was the only significant predictor of metastasis free survival. The metastasis gene signature and associated score could be applied directly to copy number alteration profiles from patient biopsies positive for prostate cancer.

1. Introduction

Prostate cancer is a common public health problem. In 2012, this disease was expected to be diagnosed in an estimated 241,740 men (29% of all male cancers) and to result in 28,170 deaths (9% of male cancer deaths) [1]. If left untreated, around 70% of prostate cancers remain asymptomatic and indolent for decades [2]. If treated with radical prostatectomy or radiation therapy, the risk of metastasis is reduced, but erectile dysfunction, urinary incontinence, and rectal bleeding may occur, affecting the patient’s quality of life. Because it is currently difficult to determine accurately which patients will develop metastatic disease, physicians treat patients with mid-to-late stage local disease aggressively, even when such treatment may not be required. Clinical parameters, such as, serum concentration of prostate-specific antigen (PSA), extension beyond surgical margins, invasion of seminal vesicles, extension beyond the capsule, surgical Gleason score, prostate weight, race, and year of surgery, are employed in existing nomograms for prediction of local recurrences after surgery [3], but, many of these parameters are not available at diagnosis and cannot be used for guiding therapeutic decisions. Development of a robust risk model from a biopsy that accurately predicts the potential of a local prostate cancer to metastasize would justify aggressive treatment in high-risk cases and improve the quality of life for men with indolent disease by allowing them to avoid treatment-related side effects. Thus, the goal of this study was to develop a method to identify tumor genomic biomarkers that could be applied to prediction models that help guide clinical treatment decisions.

The method chosen for developing the predictive model was the analysis of genomic DNA copy number alterations (CNAs) in prostate cancers, because these cancers have long been known to harbor multiple genomic imbalances that result from CNAs [4, 5]. High-resolution measurements of CNAs have functional value, in some cases providing evidence for alterations in the quantity of normal, mutant, or hybrid-fusion transcripts and proteins in the cancer cells. The resulting changes in abundance or altered structure of RNA transcripts and proteins (e.g., truncating dominant negative mutations) may impact the fitness of the cell and provide some of the mechanisms necessary for distant site migration, invasion, and growth. From the multiple CNAs identified in tumors, CNA-based gene signatures were developed into a score that suggested the ability to predict metastasis free survival.

2. Methods

2.1. Cohorts and Samples

We studied four publically available prostate cancer cohorts and a fifth cohort reported here: (1) 294 primary tumors and matched normal tissue samples from NYU School of Medicine (NYU 𝑛=29), Baylor College of Medicine (Baylor 𝑛=20) [6], Memorial Sloan-Kettering Cancer Center (MSK 𝑛=181) [7], and Stanford University (SU 𝑛=64 (single reference used for each tumor)) [8]. (2) 49 metastatic tumors and matched normal samples from Johns Hopkins School of Medicine (Hopkins 𝑛=13) [9] and MSK (𝑛=36) [7]. The 13 patients in the Hopkins cohort had multiple metastases dissected at autopsy, totaling 55 samples for the study. We also studied a sixth, publically available cohort of 337 cell lines originating from varying tumor cell types (ArrayExpress ID: E-MTAB-38).

2.2. Sample Processing

Genomic DNA (gDNA) from the NYU cohort was extracted from fresh-frozen prostate tumors using a Gentra DNA extraction kit (Qiagen). Purified gDNA was hydrated in reduced TE buffer (10 mM Tris, 0.1 mM EDTA, pH 8.0). The gDNA concentration was measured using the NanoDrop 2000 spectrophotometer at optical density (OD) wavelength of 260 nm. Protein and organic contaminations were measured at OD 280 nm and 230 nm, respectively. Samples that passed OD quality control thresholds were then run on a 1% agarose gel to assess the integrity of the gDNA. 500 ng of gDNA samples was run on the Affymetrix Human SNP Array 6.0 at the Rockefeller University Genomics Resource Center using standard operating procedures. Samples that were obtained from public sources were processed according to the methods outlined in their respective publications. Affymetrix  .cel files were processed using the Birdseed v2 algorithm [10].

2.3. Study Design

The case samples in this study were either metastatic tumors (METS) or primary tumors from men treated with radical prostatectomy that were clinically followed up and reported to develop distant metastases (mPTs). METS and mPTs are clearly discernible phenotypes that can be classified unequivocally as cases. The control samples were defined as primary tumors that had not progressed to form distant metastases following radical prostatectomy either because clinical followup was not available or because the treatment rendered the patient not informative for this outcome. Radical prostatectomy treats both indolent primary tumors (iPTs) that would not metastasize and primary tumors that would otherwise progress to form metastases, if left untreated. Thus, the control primary tumors actually represent a mixture of iPTs and unrealized mPTs. Assuming a randomly sampled cohort, it is expected that about 30% of the control group of primary tumors would be unrealized mPTs [2]. Considering the scarcity of clinically informative mPTs and iPTs for study, our strategy for identifying CNA biomarkers from tumors with inferred metastatic outcomes allowed a greater number of individual genomes to be used. Accordingly, all of the clinically informative mPTs available to us were not used to identify the biomarkers and only tested in a Cox proportional hazard model to assess the clinical usefulness of these predictors. Future tumor cohort study design using the method presented in this paper should consider the prevalence of metastatic progression to assure a large enough representation of both mPTs and iPTs. The natural history of prostate cancer, without medical intervention, (e.g., watchful waiting or active surveillance) is well documented [2]. Assuming a randomly sampled cohort, this information allowed us to estimate the prevalence of mPTs to be 30%.

2.4. Cancer Genomics Copy Number Algorithm

A genomic DNA copy number analysis pipeline (Figure 1) was designed using the R-statistical software [11] (R) to process the raw intensity data through a series of computational steps resulting in ranked lists of genes and associated significance that could be used for functional mining and prediction model development. The R-package will be provided upon request and raw and processed data can be obtained from Gene Expression Omnibus accession# GSE27105.

2.5. Raw Data Processing

Signal intensity files (.cel) for the Affymetrix SNP Array 6.0 or 500 k mapping arrays were processed using the Affymetrix Power Tools, Birdseed V2 [10], and BRLMM [12] algorithms, respectively, resulting in genotype allele calls and signal intensity measures for each SNP and copy number probe. After the first stage, the genotype calls were prepared for downstream principal component analysis for ethnic identification and quality control testing, especially important when investigating racially driven health disparities (Figure 2). Men of African descent have an increased incidence, earlier onset, and more aggressive form of the disease than those of European origin. Even when adjusted for the increased level of incidence in African Americans, the mortality rate of African American men is more than twice that of Caucasian men [1]. Although not presented in the current work, sophisticated CNA models of metastatic disease may provide a biological explanation for the epidemiological observations of racial health disparity of metastasis.

The probe-summarized intensity signals were log transformed and standardized (mean centered, standard deviation scaled) on an individual array basis and the relative copy number was calculated by subtracting the normal from the tumor intensity for each patient on a probe basis. The resulting copy number profile (CN) represented the amplification and deletion events that accumulated in each cancer sample tested.

Next, the probes were ordered as they appear in the genome and the copy number signal data (CN) was smoothed. The smoothing was conducted using a running median function (runmed in 𝑅, with endrule parameter equal to β€œmedian”). The smoothing function was termed 𝑆(CN)π‘˜, where π‘˜ represents the probe width of the smoothing window. The values of π‘˜ usually range from 5 to 151, depending on the array’s probe density and were chosen not to exceed a biologically meaningful span of total genetic distance. Considerations for π‘˜ should include the average alteration size (estimated empirically from each data set) and distance between probes as determined by the array probe density. As an extreme example, smoothing the entire arm of a chromosome will remove all local variation that exists on that arm. The function 𝑆(CN)π‘˜ thus yielded 𝑛 smoothing profiles per sample, with 𝑛 representing the number of different values used for π‘˜. An example of the multiple 𝑛 values used for chromosome 1 of a particular sample is shown in Figure 3.

2.6. Copy Number Alteration Calling Algorithm

The next part of this stage involved assigning copy number events to each probe. The reason we developed a CNA caller from scratch was because the standard calling algorithms required parameter inputs that were dependent on the signal-to-noise distribution of the copy number measures. Because cancer samples’ signal-to-noise are notoriously variable, both on a chromosome basis (within a sample profile) and across samples, this made the standard CNA calling approaches inefficient without significant reconfiguration. Therefore, we developed a method that was dynamic to the signal-to-noise variation observed in cancer genomes. We validate the effectiveness our approach (Figure 4) using a benchmark simulation data set used to test a variety of algorithms [13]. Given that SNP arrays are not designed to provide quantitative measures of copy number (but do respond linearly to CNAs), we restrict our calls to three categories: amplifications (1), deletions (βˆ’1), and neutral events (0). To determine the β€œcenter” of the genome so that thresholds can be drawn, we assume that a majority of the intensity values reflect a 2-copy state for the referenced sample, that is, the majority of the referenced tumor sample exists in a 2-copy state (manual calling is used for those samples in which this assumption is not valid). To accomplish this, we sample 10,000 random stretches of probes covering approximately 500 kilobases from the autosomes, calculate the median of each, and use the most frequently occurring value to scale the sample appropriately. Following scaling of the genome, thresholds were drawn based on quantile values and copy number states were assigned to each probe. Since this thresholding scheme was applied to every smoothing, there were 𝑛 event calls per probe. These calls result in a β€œπœŒβ€ profile, where 𝑇() represents the function of trinary binning: πœŒπ‘˜ξ€·π‘†=𝑇(CN)π‘˜ξ€Έ.(2.1) The π‘›πœŒ calls for each probe were then combined by summation, resulting in a composite profile (πœŒξ…ž) that ranged from βˆ’π‘› (signifying that a deletion was called at every smoothing for that probe) to +𝑛 (signifying that an amplification was called at every smoothing for that probe): πœŒξ…ž=𝑛𝑖=1πœŒπ‘–.(2.2)

One πœŒξ…ž profile was thus generated per sample, representing a composite of 𝑛 smoothings, and this metric was used for the rest of the primary analysis. We benchmarked our copy number calling method using a published simulation data set [13] comprised of randomly generated artificial chromosomes. Each chromosome was generated with an aberration flanking the center probe with Gaussian noise 𝑁(0,0.252) superimposed. All combinations of signal to noise (SN=4,3,2,and1) and aberration widths (π‘Š=40,20,10,and5) were produced for a total of 160,000 analysis runs. Receiver-operating characteristics (ROC) were computed from the benchmark simulation dataset [13]; where ROC is defined as a pair, ROC = (TPR, FPR), TPR = (the number of probes within the aberration width that is above a threshold)/(the total number of probes within the aberration width). FPR = (the number of probes outside the aberration width that is above a threshold)/(the total number of probes outside the aberration width). The threshold values are selected to continuously range over the values of the data points, and since ROC is piece-wise constant, only changing when a threshold is equal to the value of a data point, we only need to consider values of the data points in their sorted order. The area under the curve (AUC) of each ROC curve was used to gauge performances.

To examine the frequency of amplification and deletions for subgroups of samples or populations and evaluate the sensitivity of our CNA-calling method, we further combined the πœŒξ…ž data to create πœŒξ…žξ…ž by summing across the πœŒξ…ž profiles on a probe basis across multiple samples. Two values of πœŒξ…žξ…ž were calculated for population or subpopulation. The first value represented the sum of all positive πœŒξ…ž values in the population at any probe and was thus called πœŒξ…žξ…žamp. Likewise, the second value representing the sum of all negative πœŒξ…ž values in the population at any probe was called πœŒξ…žξ…ždel: πœŒξ…žξ…žamp∣del=𝑛samples𝑖=1πœŒξ…žξ€Ίamp∣delξ€».(2.3) An example of copy number πœŒξ…žξ…ž plot (Figure 5) is observed in a select region on chromosome X from metastases of men treated with androgen ablation therapy and primary tumors of iPTs and mPTs from other men not treated. Furthermore, differential analysis of the πœŒξ…žξ…ž values can be used to identify probes or regions of probes that comprise genes that may contribute to the phenotype being tested (e.g., iPT versus mPT or response to therapy versus no response to therapy).

2.7. Semisupervised Clustering Algorithm

Since sufficient labels were not available to train a model from primary tumors alone, we first created from a cohort of men that developed distant metastases a simplified summary metastasis profile to capture the high-frequency events, that are in part, assumed to correlate to the outcome. This clustering approach is not unsupervised, class-less clustering because we know some information about one of the components which is the summary profile from known metastasis samples. To reflect the frequency of events observed for individual metastasis CNA profiles in the summary metastasis profile, the average number of πœŒξ…ž events calculated for the group of metastases was used to set a threshold for the number of total πœŒξ…ž events used to build the summary metastasis profile. The actual probes chosen for the metastasis summary profile were based on their ranked frequency which resulted in a threshold of at least 25% of the samples exhibiting the event. Although not tested here, the theoretical specificity of the summary profile is expected to decrease as the threshold for minimum number of events called decreases, while the sensitivity of the profile decreases as the threshold of minimum number of events called increases. In the case of the MSK cohort, clustering of the 36 metastases πœŒξ…ž profiles independently yielded two well-separated clusters from which we built two metastasis summary profiles to perform semisupervised clustering with the primary tumors. Alternatively, the 13-patient Hopkins cohort made up of 55 metastases yielded only one homogeneous cluster and associated summary metastasis profile. To overcome the inherent variability with clustering algorithms, we employed a resampling hierarchical clustering method to infer an initial grouping for the unclassified primary tumors. For each iteration, a subset of the individual πœŒξ…ž profiles from the unknown primary tumors were randomly chosen with replacement and clustered with the summary copy number profile derived from the metastasis samples (one metastasis summary profile from the Hopkins cohort and two from MSK cohort). Therefore, the semisupervised clustering analysis presented here was developed to classify prostate primary tumors into subgroups with different metastatic potential (mPT and iPT) based on their CNA profiles. Distance was calculated using a binary metric, and the samples were joined using hierarchical clustering (complete-linkage method). The cluster tree was divided into two groups at the final join, and the primary tumor samples were scored 1 if they fell in the same cluster as the metastasis profile, and 0 if they were in the other cluster. Using the results from 20,000 resampling iterations of the clustering, a proximity score was generated for each sample, representing the number of times it fell in a cluster with the metastasis profile. A sample with a high score was considered to be more metastatic (mPT), while lower scoring tumors were more indolent (iPT). The similarity scores distributed throughout the possible range of values (0 to 1), allowing us to form distinct groups of tumors with significant contrast between high- and low-metastatic distance to MSK metastasis signature 1 (Figure 6). The group of samples with scores closer to the center of the distribution were omitted to further define the contrast between high- and low-scoring samples.

2.8. Metastasis Genes Inferred through Evolutionary Selection Modelling

Genomic DNA copy number alterations in local and metastatic prostate tumors are typically numerous, systematic in their genomic placement and varied in size from point mutations to duplications or deletions of entire chromosomes. Given these observations, geneticists have postulated that Darwinian selection may operate on the genomic instability in tumors [14]. High-resolution measurements of CNAs in somatic tumors have informative value, in some cases reflecting the direction in which the biochemistry of the cell controls the quantity of normal, mutant, or hybrid-fusion transcripts and proteins. During this genomic transformation, the resulting modified transcripts and proteins may impact the fitness of the cell. Guided by these principles of evolutionary selection, our analyses sought to identify the CNA landscape that reflects selection mechanisms of metastasis. Genomic selection towards a metastatic cancer phenotype can be both positive and negative and be observed in CNAs exhibiting both amplifications and deletions. For example, genes that promote metastasis and amplified in metastatic tumors would reflect positive selection, while metastasis suppressor genes that are deleted in metastases reflect negative selection. The genes associated with these regions, altered at high frequency in metastatic tumors and enriched in mPTs more so than iPTs, lead to enhanced metastatic potential. We identified specific CNAs that selected positively for metastatic potential, exhibiting amplifications in metastases and mPTs and deletions in iPTs. CNAs identified to exhibit negative selection for metastatic potential were observed to be deleted in metastases and mPTs and amplified in the iPTs. Therefore, we designed models based on Darwin’s evolutionary selection theory to score positive and negative selection based on the mPT and iPT classifications derived through semisupervised clustering using the πœŒξ…ž data. For each probe on the array, we calculated an enrichment score, EN(π‘₯), which represented the relative number of amplifications versus deletions, observed in each subgroup (metastasis, mPT and iPT): EN(π‘₯)=(#Ampβˆ’#Del).#Samples(2.4)

Next, we modeled the relative enrichment by contrasting the metastasis and mPT copy number alterations with those observed in the iPT group: SM=𝑒[EN(METS)+π‘žβˆ—EN(mPT)βˆ’EN(iPT)].(2.5)

The first two enrichment terms (for metastatic and metastatic-like samples) being summed were designed to assign a higher score when the METS and mPT samples had more amplifications than deletions. Greater amplification enrichment in the METS and mPTs resulted in higher scores. The third term, EN(iPT), was higher when the iPT samples exhibit the opposite effect (enrichment for deletions over amplifications). The middle term, EN(mPT), was multiplied by a data-driven coefficient, π‘ž, representing the average contribution of mPT on a probe basis (Figure 7).

For example, probes that were amplified in all metastases and mPTs but deleted in all iPTs (positive selection driving the metastasis cells) would yield the highest possible score. Likewise, probes that were deleted in all metastases and mPT samples, but amplified in all iPT samples (negatively select or inhibit the promotion of the metastasis cells), would reach the minimum possible score. Therefore, regions of the genome that enhance and inhibit metastasis formation will be captured by our evolutionary selection model.

Following this probe scoring method we developed a 𝑍-score model in order to extend this analysis to the gene level. We assign each probe to a gene, provided it falls within 10,000 bp up- or downstream of the transcription start or stop site. The SM scores for the probes within a gene are averaged and compared to the mean and standard deviation of a background distribution, which was calculated by sampling the top 5th percentile of amplified or deleted probes from all genes on the array with the same number of probes as the gene in question. The result is a 𝑍-score for each gene in the genome that is represented on the array.

2.9. Metastatic Potential Score and Survival Analysis

We developed an algorithm based on genomic CNAs to calculate a metastatic potential score (MPS), with a higher score indicating a greater likelihood of metastasis. The MPS score for a new individual patient only depends on the CNA profile of this new patient. It can be calculated without requirement for other samples, since it’s simply based on the concordance/discordance relationship to the CNA metastasis gene signature previously identified as selecting for the metastatic phenotype through our selection model. The MPS was calculated using a weighted 𝑍 score from the top set of CNAs overlapping metastasis genes determined by the significance of their selection model 𝑍 scores. We used 𝑍β‰₯1.7 as a cutoff point because for standard normal distribution, the tail of 1.7 is about 5%. The metastatic potential score was defined as the following: MPS=𝑛𝑖=1π‘ξ…žπ‘–βˆ—Dirsig(𝑖)βˆ—Dirsamp(𝑖).(2.6)

For each tumor profile, logistic adjusted 𝑍 scores (π‘ξ…ž) from genes (𝑖…𝑛) that match the direction of the metastasis gene signature (a vector of βˆ’1s and +1s representing whether the gene was deleted or amplified in the signature, resp.) were added, whereas π‘ξ…ž from genes that mismatch the direction of the signature were subtracted. As the direction component of the risk model score (Dir) reflects, if the CNAs of the metastasis signature (Dirsig) and the unknown sample profile (Dirsamp) are in the same direction, the coefficient will be 1; if they are in opposing directions, the coefficient will be βˆ’1; and if Dirsamp(𝑖) = 0, then the entire term will not count towards the score. For example, if a gene 𝑖, that is typically amplified in metastases (Dirsig(𝑖)) and mPTs, is also amplified in the unknown profile (Dirsamp(𝑖)) that 𝑍 score is added, whereas if gene 𝑖 in the profile is deleted, as expected in iPTs, the 𝑍 score is subtracted. Neutral genes that are neither amplified nor deleted in the unknown profile are not scored in this model.

Three metastasis signatures, derived from a combination of five cohorts were used to develop the MPS. The first signature was identified using 49 primary tumors of unknown clinical outcome from NYU (𝑛=29) and Baylor (𝑛=20) and a metastasis cohort from Hopkins (𝑛=13). The other two signatures were identified using 75% of the MSK cohort of primary tumors of unknown outcome (𝑛=126) along with a set of metastatic tumors (𝑛=36) from the same MSK cohort. The CNA-based gene signatures from these 2 sets of cohorts were concatenated and derived into the MPS which we assessed in a Cox proportional hazard model with samples set aside for testing purposes only. The test cases were comprised of bona fide mPTs (primary tumors that later developed into distant metastasis), whereas the test controls were derived from a random sample of tumors with unknown outcome not used to build the MPS. All presurgery predictors (PSA, clinical stage, biopsy Gleason) and other demographic variables (age at diagnosis and race) were tested independently and in combination with the MPS in Cox proportional hazards survival analysis with the time variable represented by progression to metastasis.

3. Results

3.1. Prediction Models

Our selection models resulted in three hundred and sixty-eight genes (from 3 metastasis signatures) with a CNA status that was concordant among METS and mPTs and contrasted with iPTs (𝑍β‰₯1.7) (Supplemental Table 1, see Supplementary Materials available online at http://dx.doi.org/10.1155/2012/873570). With these genes, we developed the MPS and tested the accuracy as an independent predictor of metastasis, with a subset of primary tumors (𝑛=52) not used to develop the signatures (𝑛=13 mPTs and 𝑛=39 control primary tumors, Table 1). As a continuous predictor, applying the MPS to a Cox proportional hazards model resulted in a significant association to the endpoint of metastasis-free survival (2.88; 95%  CI=1.15βˆ’7.2;𝑃=0.02) (Table 2).

Patients diagnosed with prostate cancer have several pretreatment variables, such as, clinical stage (combination of digital rectum exam, PSA, and ultrasound/MRI), biopsy Gleason score and other demographic measures (e.g., age or race) to guide the decision to undergo surgery. These variables have marginal clinical utility and, in our cohorts, none of these clinical variables were statistically significant in univariate or multivariate logistic regression models. In multivariate Cox regression models (Table 2), only the MPS score reached statistical significance, indicating, that the MPS score was the only reproducible predictor of metastasis-free survival.

Notably, the clinical stage was specific when palpable tumor was detected (T2 or greater); however, it lacked sensitivity, because 47% (9/19) of pathological stage-4 cases that evaluated ex-vivo were diagnosed as T1C before surgery [7]. Twenty-seven percent (13 out of 49) of clinical stage T1C tumors that were upstaged following prostatectomy resulted in distant metastasis formation. Therefore, staging at the time of biopsy can seriously underestimate the severity of disease. Similarly, the biopsy Gleason score versus the postsurgery Gleason score was underestimated in 38% of cases and overestimated in 8% [7] (Figure 8).

3.2. Metastatic Potential Score Distributions

Significant differences as measured by Mann-Whitney test of the MPS were observed for the metastasis (𝑃<0.001) and mPT (𝑃=0.001) groups, compared to the control primary tumors (Figure 9). The MPS in the lymph-node-positive primary tumors (derived from the MSK (𝑛=9) and Stanford (𝑛=9) cohorts did not differ significantly from the control tumor group (𝑃MSK=0.34,𝑃Stanford=0.13,𝑃Combined=0.08), which reflected the marginal ability of this clinical parameter to predict distant metastasis in previous reports [15].

Consistent with our assumption that the control cohorts contained a fraction of mPTs, their MPS overlapped the MPS range of the cases. Furthermore, control primary tumors (from MSK cohort) that did not recur biochemically (as measured by PSA) after 80 months of followup, (represented by green Xs in Figure 9) were not significantly correlated with the MPS. To determine whether other cancer types exhibited a similar metastatic landscape of CNAs to that observed in prostate cancer, we calculated the metastatic potential score for 337 cancer cell lines. We observed an overall distribution that overlapped with low-risk prostate primary tumors (Figure 9). However, 22 of the 337 cell lines ranked by MPS were above the 75th percentile of the prostate primary tumors and metastases. These cell lines originated from tumors of the lung (𝑛=10), breast (𝑛=3), colon (𝑛=2), and melanoma (𝑛=2). Other singletons in this group of 22 cell lines originated from thyroid, rectum, pharynx, pancreas, and kidney.

3.3. Biomarker Functional Significance

Another way to validate our algorithms is by data mining the functional attributes of the metastasis genes identified by the selection model. As expected, many of the top-ranking metastasis genes identified have molecular functions related to alteration of nuclear and extracellular matrix structure and metabolic modification that enhance processes characteristic of escape from anoikis (a key metastasis specific process). A heat map of the CNA events of signature genes for all prostate tumors is suggestive of a path toward the different high frequency amplification versus deletion events that contrast the high-risk and low-risk tumors (Figure 10). The mid-risk region with its relative paucity of signature events may represent the starting point of two alternative pathways of subsequent copy number alteration, one leading to metastasis and the other to an indolent state. The locking in of these β€œantimetastasis” events in indolent tumors may explain why they failed to metastasize despite extended periods of watchful waiting.

One of the top predictor genes, the solute carrier family SLC7A5 gene, deleted on chromosome 16q24.2, encodes a neutral aminoacid transporter protein (LAT1) that has been implicated in multiple cancers (prostate [16], breast [17], ovarian [18], lung [19], and brain [20]) and has been shown to have utility as a diagnostic [21–23] and drug target in cell line [24–26] and preclinical animal models [27]. The normal function of LAT1 is to regulate cellular aminoacid concentration, L-glutamine (efflux) and L-leucine (influx). Reduced activity of LAT1 results in increased concentrations of L-glutamine which has been shown to constitutively fuel mTOR activity [28]. Seven other solute carrier superfamily members (SLCO5A1, SLC7A2, SLC10A5, SLC26A7, SLC25A37, SLC38A8, and SLC39A14) were predictive of metastatic potential in our models, likely creating a cellular environment conducive to metastasis.

A second subset of signature genes included 6 Cadherin family members encoding calcium dependent cell adhesion glycoproteins (CDH2, CDH8, CDH13, CDH15, CDH17, and PCDH9). Many of the Cadherin family proteins have putative functions associated with metastasis progression [29] and have been included in diagnostic panels [30, 31].

A third subset of 5 genes predicted to contribute to metastatic potential were potassium channels, KCNB2, KCNQ3, KCNAB1, KCTD8, and KCNH4. Notably, 3 other potassium channels reside in the highly amplified region between 8q13 and 8q24 (KCNS2, KCNV1 and KCNK9) that did not rank high in our analysis but may have weak or modifier effects. High levels of cytoplasmic potassium ion concentrations have been shown to inhibit the hallmark mitochondrial apoptotic cascade of membrane disruption and ensuing release of cytochrome C, caspase, and nuclease degradation of cellular components [32]. Furthermore, another study showed that the methylation status of potassium channel, KCNMA1 (10q22.3), was predictive of prostate cancer recurrence [33]. The activity of voltage-gated potassium channels in prostate cancer cell lines, LNCaP (low metastatic potential) and PC3 (high metastatic potential), were observed to be markedly different [34]. The complete set of metastasis signature genes likely represents various subsets of functions. Representation of different gene family members suggests that each tumor may have a unique profile to progress to metastasis, yet different members of a gene family may contribute to a functional redundancy. Notably, the genomic DNA landscape around the androgen receptor locus on chromosome X represents a compelling observation linking CNAs to a functional cause and effect response of androgen ablation therapy (Figure 5).

4. Summary

In this study, we developed a semisupervised clustering algorithm that can infer the classification of a primary tumor based on metastatic risk. This was essential to overcome the limitations inherent to prostate cancer cohorts for collecting long-term clinical outcome data. Our novel approach to modeling the CNA data based on Darwin’s evolutionary selection theory allowed us to identify genes associated with the specific metastatic processes of anoikis. Current clinical models for assessing risk are aimed at predicting biochemical recurrence, rather than metastasis, and do not include genomic information. This limitation was underscored in a study with a large cohort of greater than 10,000 men who had undergone radical prostatectomy [35]. Within that cohort, about 20% of men developed biochemical recurrence within 5 years of the procedure, but subsequently only 10% of the men with biochemical recurrence developed distant metastases after 12 years.

This proposed new classification method and selection model allowed us to develop a metastatic potential score that could be used for predicting an individual’s metastasis-free survival at the time of diagnosis. With validation in additional cohorts and statistical models with known metastasis outcome, this approach may lead to a significant advancement in determining whether aggressive treatment of prostate cancer is necessary. This predictor might be important for correctly categorizing men at the time of diagnosis and could predict whether surgery, radiation therapy, or watchful waiting was warranted. Because the proposed tool, tumor genomic analysis, is comprehensive for identifying the genetic changes that are associated with the pathogenesis of metastasis, there is a greater likelihood of selecting a sufficient number of markers that are both sensitive and specific predictors. This method could be applied to other cancers (e.g., breast) that exhibit variation in the metastatic potential of the primary tumor and have similar difficulties in collecting tumor samples with long-term clinical outcome data.


The authors would like to thank Dr. Kelly Maxwell for extracting the genomic DNA for the NYU cohort and all of the reviewers for their thoughtful suggestions.

Supplementary Materials

A table of the metastasis gene signature is provided with chromosomal coordinates. Each gene has an associated Z-score (_Z) and the direction of the alteration (_dir) amplification (+1) or deletion (–1) is represented for each of the three signatures (NYU, MSKs1 and MSKs2).

  1. Supplementary Table