Abstract

Genomic copy number variations are a typical feature of cancer. These variations may influence cancer outcomes as well as effectiveness of treatment. There are many computational methods developed to detect regions with deletions and amplifications without estimating actual copy numbers (CN) in these regions. We have developed a computational method capable of detecting regions with deletions and amplifications as well as estimating actual copy numbers in these regions. The method is based on determining how signal intensity from different probes is related to CN, taking into account changes in the total genome size, and incorporating into analysis contamination of the solid tumors with benign tissue. Hidden Markov Model is used to obtain the most likely CN solution. The method has been implemented for Affymetrix 500K GeneChip arrays and Agilent 244K oligonucleotide arrays. The results of CN analysis for normal cell lines, cancer cell lines, and tumor samples are presented. The method is capable of detecting copy number alterations in tumor samples with up to 80% contamination with benign tissue. Analysis of 178 cancer cell lines reveals multiple regions of common homozygous deletions and strong amplifications encompassing known tumor suppressor genes and oncogenes as well as novel cancer related genes.

1. Introduction

Genomes of cancer cells are known to harbor multiple regions of deletions and amplifications. Many of these CN alterations are probably random but some accelerate cell growth and suppress apoptosis. For example, homozygous deletions lead to inactivation of such tumor suppressor genes as RB1 [1], p16 [2], and PTEN [3], while amplifications lead to activation of such oncogenes as MYC [4], EGFR [5], and ERBB2 [6]. Moreover, it has been shown that heterogeneity of cancer outcomes [7] and sensitivity to chemotherapy can be at least, in part, explained by CN changes in primary tumor [8, 9]. Thus it is of great importance to identify CN variations frequent for a specific tumor type or present in a specific sample. Array comparative genomic hybridization is a method designed to detect such genomic region alterations [10]. The development of high-density arrays as well as advances in data analysis have greatly improved our ability to determine CN regions [11, 12]. Different methods of analysis have been reviewed by Chari et al. [13].

Many of these computational methods detect regions with deletions and amplifications without estimating the actual CN in these regions. For example, these methods do not clearly distinguish between homozygous and heterozygous deletions which may have significantly different effects on cancer development; losing just one of two copies of a gene should not have as drastic of an effect as losing both copies. In fact, heterozygous deletions are very common and affect large regions including chromosomal arms and whole chromosomes. In contrast, homozygous deletions are much less common and usually affect small regions with only a few genes deleted. Similarly, it is important to be able to distinguish between low and high copy amplifications. Due to their overall genomic instability, cancer cells might accumulate multiple random CN changes which do not contribute to cancer development. Accordingly, one should expect that only regions with very high amplification, 10 to 100 copies, are regions that are likely to encompass oncogenes such as MYC, EGFR, and ERBB2.

In this study we attempted to develop a method to estimate CN values for normal samples, cell lines, and solid tumors. To be able to do this we had to determine how signal intensity from different probes is related to CN, how to take into account change of the genome size in cancer cell lines and solid tumors which result from somatic alterations, and finally how to incorporate into the analysis contamination of the solid tumors with benign tissue. The last factor is of particular importance as solid tumor samples are strongly contaminated with noncancerous cells carrying DNA without somatic alterations. Existing dissection techniques can reduce this contamination but cannot eliminate it entirely. For example, it is practically impossible to eliminate immune cell infiltration of tumors. In our experience, even ovarian tumors which are considered less affected by noncancerous contamination than other tumors often contain more than 50% normal, noncancerous, DNA. Such strong contamination presents a significant problem for estimating CN. For example, if in a tumor sample a particular region appears to have one copy, the actual CN in cancerous cells can be either one () if the sample has negligible contamination or zero () if the contamination is about 50%.

2. Materials and Methods

2.1. Genomic DNA

Frozen tumors were cut into 10 m sections and macrodissected to minimize contamination with normal tissue. A QIAamp DNA Mini Kit (QIAgen) was used to isolate the DNA as per the manufacturer’s protocol with an overnight lysis incubation at , including the optional RNase A treatment. DNA was quantitated using a Nanodrop spectrophotometer and picogreen.

2.2. Affymetrix 500K GeneChip Arrays

The Affymetrix GeneChip Mapping NspI or StyI Assay Kit was used in the generation of biotinylated DNA for Affymetrix Mapping 500K NspI or StyI microarray hybridizations (each assay was prepared separately). Genomic DNA (250 ng) was digested with NspI or StyI restriction enzyme and adaptors were added to restriction fragment ends with T4 DNA ligase. Adaptor-modified samples were PCR amplified using Clontech Titanium Taq which generated an amplified product of average size between 200 and 1,100 bp. Amplification products were purified using a Clontech DNA amplification cleanup kit. 90 g of purified DNA was fragmented using Affymetrix Fragmentation Reagent. Biotin-labeling of the fragmented sample was accomplished using the GeneChip DNA Labeling Reagent. Biotin-labeled DNA was hybridized on NspI or StyI Affymetrix microarrays at for 16 to 18 hours in the Affymetrix rotation oven. After hybridization, probe array wash and stain procedures were carried out on the automatic Affymetrix Fluidics Stations as per manufacturer’s manual and microarrays were scanned and raw data were collected by Affymetrix GeneChip Scanner 3000.

2.3. Quantitative Real-Time PCR

12 cell lines were identified with copy number amplifications detected by microarray spanning the genes CCND1, EGFR, and/or ERBB2. Custom qPCR assays (Applied Biosystems) were designed which would amplify a region of genomic DNA from each of these genes, and from 2 genomic regions where the cell lines did not show any copy number changes. These 2 assays were used as calibrators to calculate dCt in subsequent analyses. Genomic DNA from each of the cell lines was diluted to 5 ng/l. 1 L genomic DNA, 8 l H2O, 1 l qPCR assay, and 10 l TaqMan Universal PCR Master Mix (Applied Biosystems) were combined and cycled in an Applied Biosystems 7900 real-time PCR instrument using the following conditions: for 2 minutes, for 10 minutes, 40 cycles of for 15 seconds, for 1 minute, hold at .

2.4. Agilent 244K CGH Arrays

0.5–3 g test and reference (Promega, p/n G152A) genomic DNA samples were simultaneously digested with AluI and RsaI restriction enzymes. Following fragmentation, the DNA samples were labeled using an Agilent Genomic Enzymatic Labeling Kit. The labeling kit uses random primers and the exo-Klenow fragment to label DNA through incorporation of fluorescently labeled nucleotides (Cy3-dUTP or Cy5-dUTP, for test or reference DNA, resp.). Labeled samples were purified by Microcon YM-30 columns or AutoScreen-96A Well plates. The concentration of the purified samples was determined using a NanoDrop ND-1000 spectrophotometer.

Equal amounts of test and reference fluorescently labeled samples were pooled and heat denatured after being combined with cot-1 DNA, Agilent aCGH blocking agent, and Agilent Hi-RPM hybridization solution. Microarray hybridizations were performed using Agilent SureHyb Hybridization chambers. Hybridization chambers were loaded onto a rotisserie in an Agilent Hybridization Oven and were incubated at for 40 hours with a rotational speed of 20 rpm. Following incubation, the microarray slide was washed for 5 minutes in aCGH/ChIP-on-chip Agilent Wash Buffer 1 at room temperature and 1 minute in Agilent Wash Buffer 2 at . Microarray slides were scanned on an Agilent scanner and raw data were collected.

3. Results

3.1. Combining Signals from Multiple Probes on Affymetrix 500K GeneChip Arrays

The Affymetrix 500k GeneChip array contains 25-mer oligonucleotides distributed over two subarrays, Nsp1 and Sty1 containing 262,264 and 238,304 SNPs, respectively. Each SNP on the array is represented by either six or ten oligonucleotide probe quartets consisting of perfect-match (PM) and mismatch (MM) pairs for both SNP alleles. These probes reside on Nsp and Sty PCR amplicons, which range in size from 100 bp to 1143 bp. For CN analysis the signal intensities from the multiple probes representing a SNP should be optimally combined to generate a single value corresponding to the SNP.

It has been shown [14] that GC content and length of PCR amplicons affect signal intensities from probes and this effect varies from sample to sample. It also appears that the effect of amplicon size is much stronger than the effect of GC content. Our data (not shown) support this conclusion. Therefore, we did not take GC content into account. However, we made an adjustment for amplicon size. First for each quartet and allele, the intensity from the MM probe is subtracted from the intensity from the PM probe: . Then for a given amplicon size s, an average ∆Is over all quartets and alleles for SNPs residing on PCR amplicons of size s within a subarray is calculated. Finally, a normalized intensity is calculated for each allele X and each quartet q where s is the size of the corresponding amplicon.

The next step is to combine signals from both alleles A and B within a quartet q using the following formula: where and are normalization parameters. These parameters are introduced to ensure that is close to 2 independent of the genotype (AA, AB, or BB) of a sample. The ratio of these parameters, /, represents uneven allele amplification described previously for SNP arrays [15]. In order to estimate parameters and , we have used genotyping data for 48 cell lines provided by Affymetrix. Because these cell lines were collected from noncancerous tissue, we assume that for almost all SNPs copy number will be equal to 2. Therefore, for each SNP we sought to minimize the following expression: Here is the number of samples with genotype for this SNP among 48 samples (), while is the median value of in samples with genotype and is the median value of in these samples. Minimizing the function over parameters and results in the following formulae:

where Using these formulae we calculated parameters and for all SNPs on Affymetrix 500K GeneChip arrays except for 10,841 SNPs for which all 48 cell lines were homozygous. These SNPs were excluded from the analysis. To evaluate the strength of the uneven allele amplification effect we calculated the median of . The median is about 1.40, meaning that for a typical SNP on Affymetrix 500K GeneChip arrays, signal intensity for one of the alleles is about 40% higher than the signal intensity for the other allele.

The values for all quartets for SNP i are linearly combined with weights reflecting their performance: . The weights are selected to minimize the variance of under the assumption that the deviations of from 2 are independent for different quartets. Thus, the optimal weights are determined by the following formula: where is the variance of within 48 cell lines. The estimated variance of is To evaluate the importance of optimization of weights for different quartets, we can compare with the corresponding variance resulting from using equal weights for all quartets. The estimate for the latter variance is , where is the number of quartets for SNP i. Median ratio of the variances, , over all SNPs is about 1.97 which means that for a typical SNP using optimal weights should reduce variance almost twofold.

For each SNP we have calculated defined as averaged over 48 cell lines. The median for all SNPs is 0.0376. We have excluded 1,832 SNPs with exceeding an arbitrary cutoff 0.25. The important exception of the described method concerns SNPs on the X chromosome (outside the pseudoautosomal region). For these SNPs parameters were determined using only cell lines from females.

Some of the PCR amplicons on the Affymetrix 500K GeneChip array contain more than one SNP. Any CN changes affecting such an amplicon should equally affect all SNPs within the amplicon. Therefore, signals for all SNPs within an amplicon are averaged and assigned to one of the SNPs while the other SNPs are excluded from further analysis. This reduces the number of SNPs by 82,189 leaving 405,706 SNPs for CN analysis.

3.2. Relationship between SNP Signal Intensity and CN

While normalized SNP signal intensities described in the previous section are supposed to generate an average signal of 2.0 for SNPs within regions of genome with , one should not expect that the average signal is equal to the actual CN in the regions of the genome where CN is different from two. For example, the slope of the relationship between signal intensity from an array and the amount of DNA in the solution differs from the ideal value 1.0 [16]. Moreover, this relationship might be nonlinear as well. To evaluate this effect for regions of the genome with we used SNPs on the X chromosome (outside the pseudoautosomal regions) for 28 cell lines from males provided by Affymetrix. In addition, we have analyzed cell lines containing different numbers of X chromosomes (all karyotypes in the paper are described accordingly to ISCN 2009 nomenclature [17]): NA04626 (47,XXX), NA01416E (48,XXXX), and NA06061 (49,XXXXX). This allowed us to estimate the relationship between CN and signal intensity up to . Finally, we have selected a few cancer cell lines for which it has been reported that certain genes are strongly amplified. These cell lines and a reference sample RS were analyzed by qPCR assays amplifying the corresponding genes as well as a reference gene RG which has 2 copies in these cell lines. CNs for the amplified genes were estimated using the formula , where is the value for gene x for an amplified gene and for the reference gene) in sample y ( for a cell line and for the reference sample). The same cell lines were run on Affymetrix 500K GeneChip arrays. Comparison of the results allowed us to estimate the relationship between CN and signal intensity for high CN values. Results of the experiments are shown in Table 1 and Figure 1. Signal intensity for SNPs in regions with increases roughly linearly with CN. Therefore, for CN the expected signal intensity is given by Table 1 for and for .

3.3. Adjustment for Average CN and Contamination with Normal DNA

As described above, one of the steps in deriving involves normalization by average intensity of other SNPs. This normalization allows accurate estimation of CN for noncancerous samples where most of the genome has two copies. In cancerous samples the average CN, due to amplifications and deletions, can be very different from two and, therefore, additional signal normalization is required. In addition, tumor samples are often contaminated with noncancerous cells with most of the genome having two copies. As a result, the expected signal intensity for an SNP within regions with CN c is defined by the following equation: Here is a normalization factor to account for change in average CN within the genome of cancerous cells, and is the degree of contamination of tumor samples with normal cells. This equation is applicable to all types of samples: normal noncancerous cells, cancer cell lines, and tumors. However, for cancer cell lines we assume no contamination with noncancerous cells and, therefore, . For noncancerous samples we assume that the average CN is equal to 2.0 and, therefore, and .

In order to estimate the parameters and , we use the following algorithm. First, for each SNP i we calculate smoothed signal : where the sum is taken over SNPs surrounding SNP . Then we calculate the histogram of the smoothed signals with bin size of 0.1. The histogram is expected to have local maxima corresponding to CN values present in a given sample. Let be the total number of the local maxima and let be the positions of the maxima (). For a given pair of parameters and , for each maximum k we assign an integer CN value which minimizes the absolute difference where is given by (7). Then we calculate the following sum: This sum measures how well a given pair of parameters and fits the positions of all the maxima . In order to find the best parameters and , we minimize by direct enumeration of all pairs of and with within the interval (0,1) and within the interval (0.5,2). Unfortunately, this minimization is not always able to distinguish between the correct solution and incorrect ones. In order to avoid incorrect solutions, before optimization over parameters and we assign CN to some of the maxima according to the following rules. (1)If there is only one maximum (), then .(2)If and the first maximum has SNP genotyping call rate below 60%, then .(3)If and the first maximum has SNP heterozygosity below 3%, then .(4)If , and the second maximum has SNP heterozygosity below 3%, then .(5)If and the highest maximum is at least two times higher than the second highest maximum, then the highest maximum is assigned .

Rule () is based on the assumption that the SNP genotyping call rate should be low in the regions with . Rules () and () are based on the assumption that SNP heterozygosity should be low in the regions with . Both assumptions are valid for cell lines and tumors with low contamination with noncancerous DNA. Rules () and () are based on the assumption that the dominating CN equals 2.

3.4. Hidden Markov Model

The likelihood that SNPs have copy numbers can be written as Here the first product is taken over all SNPs, and the normal distribution is assumed for deviation of the natural logarithm of the actual SNP signal forms the natural logarithm of the expected signal . In addition, the deviations are assumed to be independent and to have the same standard deviation . The second product is taken over all pairs of adjacent SNPs, if and otherwise, and is the probability that the adjacent SNPs have different CNs. The most likely CN solution is the one that maximizes the likelihood . The most likely solution depends both on and or, more precisely, on the value of .

The likelihood defines Hidden Markov Model (HMM) with the states being CNs of individual SNPs [18, 19]. Therefore one can use forward-backward procedure [19] to find the maximum likelihood state. In order to make the model computationally tractable, we limit the largest possible CN to be 70. Another simplification is related to the fact that individual chromosomes are independent and, therefore, the maximization of the likelihood is performed separately for all chromosomes. First, we find the maximum likelihood state using values for and   obtained by minimization of the expression (9). Then we minimize over and the following sum: Then we find the maximum likelihood state using values for and obtained by minimization of the expression (11). We repeat this iteration procedure until it converges as defined by change in upon one iteration being less than 3% and change in being less than 0.1%. Formula (11) also provides the estimate of the variance in expression (10).

3.5. Analysis of 48 Normal Cell Lines

First we have analyzed the data for 48 normal cell lines provided by Affymetrix. One of the advantages of this analysis is that the samples were collected from families (triplets), allowing us to test our algorithm; germ line deletions and amplifications should be inherited from parents. In addition, CN changes are expected to be relatively uncommon. Using this dataset we have optimized parameter in formula (10) to reduce the number of false positive and false negative CN changes: . We have found multiple amplifications with CN up to five and heterozygous deletions () as well as a few very small homozygous deletions. These observations are consistent with published results [20]. As expected, we have found for X chromosome for all males. Moreover, for all males we have observed for the pseudoautosomal regions.

Results of the analysis of one of the triplets (NA12056 from the father, NA12057 from the mother, and NA10851 from the son) are shown on Figure 2. All deletions and amplifications in NA10851 are, as expected, inherited from the parents: heterozygous deletion on chromosome 4 from the father, heterozygous deletion on chromosome 7 from the mother, amplification to on chromosome 10 from the father, heterozygous deletion on chromosome 11 from the mother (who actually has a homozygous deletion of this region), and heterozygous deletion on chromosome 13 from the father.

Similar analysis of the other triplets revealed some incompatibilities between CN alterations in children and their parents. Some of these incompatibilities are caused by true somatic CN alterations in the samples as was reported previously [20]. Other incompatibilities are caused by the algorithm either missing some real alterations (false negatives) or reporting some unreal alterations (false positives). We found that the rate of false alterations is high for alterations encompassing only a few SNPs. Starting from seven SNPs the rate of false CN alterations is below 5%.

3.6. Analysis of 178 Cancer Cell Lines

As an example we present here the analysis of the ovarian cancer cell line OVCAR8. Smoothed signals are shown in Figure 3(a) with the corresponding histogram shown in Figure 3(b). There are seven maxima () in the histogram. Table 2 lists call rates and heterozygosities for SNPs within the seven maxima as well as heights of the maxima. SNPs within the first maximum have low call rate and thus are assigned . SNPs within the second peak have low heterogeneity and thus are likely to correspond to . The third peak has to correspond to , which is supported by the fact that this peak is the highest. Since this is a cell line, we assume that . Minimization of over gives . After adjustment it becomes clear that the fourth peak corresponds to , the fifth to , the sixth to , and the seventh to . Running HMM refines the estimate for . The final result of CN analysis for this cell line is presented on Figure 3(c). Homozygous deletions of several genes are observed including EVI1 and WWOX.

We have analyzed 178 different cancer cell lines (44 breast, 35 colon, 19 brain, 14 ovarian, 14 lung, 13 melanoma, 11 leukemia, seven pancreatic, six bladder, three kidney, two uterus, two testicular, two prostate, two lymphoma, one thyroid, one salivary gland, one retina, and one plasmacytoma). The number of CN variations in these cell lines varied from two to 369 (the median value being 94). We have observed 510 homozygous deletions encompassing seven or more SNPs. Homozygous deletions, which were observed in at least five different cancer cell lines, are shown in Table 3. Some of these deletions encompass well-known tumor suppressor genes such as p16 and PTEN. Others might encompass unknown tumor suppressor genes, and some of these deletions might simply happen within fragile parts of the genome. We have also observed 580 amplifications with CN of at least 10. Those amplifications, which were observed in at least five different cancer cell lines, are shown in Table 4. Some of these amplifications encompass well-known oncogenes such as MYC and ERBB2. Others might encompass oncogenes which are not yet known.

3.7. Adjustment on Contamination with Benign Tissue in Tumor Samples

In order to check how well our method estimates the degree of contamination of tumors with benign tissue, we artificially contaminated DNA samples of eight cancer cell lines (see Table 5) from different tissues with DNA extracted from CEPH cell line NA12776. All cancer cell lines as well as NA12776 were sampled from female subjects. The DNA samples were quantitated by picogreen three times for higher precision and then combined to give degrees of contamination of the different cancer cell lines with NA12776 DNA between 10% and 80%. After that we ran these mixed samples on Affymetrix microarrays and estimated degrees of contamination using our algorithm. Results of this analysis as well as the mix ratio based on DNA quantitation are presented in Table 5. For the majority of samples the differences between the estimated and the DNA quantitation data were within a few percent. This is quite remarkable, because the observed difference is close to the error rate of the picogreen quantitation. As one could expect, the observed differences tend to be higher for higher contamination levels.

The cancer cell lines used in the mixing experiment were also run on Affymetrix microarrays without any mixing. Comparison between results of CN analysis of mixed versus nonmixed samples is presented in Table 5. The concordance between CN results is expressed as percent of SNPs with the same CN. With one exception the concordance is above 95% for contamination levels below 80%. While the sensitivity of CN analysis inevitably decreases as contamination with benign tissue increases (and small alterations in CN affecting only a few SNPs become indistinguishable from noise), our algorithm is able to determine the degree of contamination as well as major changes in copy number correctly in strongly contaminated samples.

Figure 4 demonstrates the importance of adjustment on the contamination with benign tissue for CN analysis of tumor samples. In Figure 4(a) one can see signal intensities of SNPs for a colon tumor sample. The contamination of this sample with benign tissue, determined by our program, is about 48%. Such strong contamination leads to a dramatic shift of levels of signal intensities for SNPs within regions with different CN values. The right CN solution after adjustment of signal intensity on contamination with benign tissue is presented on Figure 4(b). Because this sample was collected from a male subject, there is only one copy of X chromosome in both tumor and normal cells. As a result, despite contamination the signal intensity for SNPs within the X chromosome (outside the pseudoautosomal regions) is the same as it should be for SNPs within regions with in the absence of any contamination. However, one can see that the signal intensity of SNPs within large heterozygous deletions on chromosomes 1, 6, 10, and 18 is significantly higher than the signal intensity of SNPs within the X chromosome (see Figure 4(a)). SNPs within amplified regions on chromosomes 13, 15, 16, and 17 have signal intensity as if their CNs equal 3. However, after adjustment on contamination it becomes clear that these are actually amplifications to . Moreover, SNPs within homozygous deletions on chromosomes 10 and 17 because of the contamination appear to have the same signal intensity as SNPs within the X chromosome. In Figure 4(c) we presented the CN solution, obtained by the HMM algorithm, if adjustment on contamination with benign tissue is not made. All CN variations in this solution have been determined incorrectly (besides single copy of the X chromosome): heterozygous deletions on chromosomes 1 and 15 and amplification on chromosome 16 are lost altogether, positions for the heterozygous deletions on chromosomes 6, 10, 18, and 21 are wrong, homozygous deletions on chromosomes 10 (gene PTEN) and 17 are instead called heterozygous, and, finally, is assigned for the amplification on chromosome 15 instead of . Generally we have been able to determine CN values in solid tumors even when the degree of contamination with benign tissue is significant.

3.8. Using Agilent 244K Oligonucleotide Arrays for Copy Number Analysis

Agilent 244K oligonucleotide arrays with complete genome coverage are designed for copy number analysis [21, 22]. Instead of SNP probes the array has 60-mer oligonucleotide probes. Our method of analysis of Agilent array data is very similar to the described method of analysis of Affymetrix SNP array data with only two differences. First, we do not need to combine signals from individual probes into an SNP signal; rather we use signal intensities supplied by Agilent software. Second, when we estimate and in (9), we cannot use either SNP call rate or SNP heterozygosity.

4. Discussion

In this paper we have presented a new method of CN analysis of the data from Affymetrix 500K GeneChip arrays and Agilent 244K oligonucleotide arrays. The method is designed to determine positions of CN variations as well as to estimate actual CN. In most of the published algorithms, authors attempt to determine position of CN variations and distinguish between deletions and amplifications without determining the actual value of CN [19]. We reason that estimating actual CN is extremely important. For example, it is likely that most CN variations with , 3, and 4 are random in the sense that they are not positively selected for during cancer development. On the other hand, homozygous deletions and strong amplifications are most likely to be detected in the regions with genes related to cancer cell survival and growth.

In order to be able to estimate actual CN within amplified and deleted regions, we have determined how signal intensity from different probes is related to CN, how to take into account change of the genome size in cancer cell lines and solid tumors as the result of somatic alterations, and finally how to incorporate into the analysis contamination of the solid tumors with the benign tissue. HMM has been used to determine exact CN values as well as the positions of the affected regions.

We have tested our method on 48 normal cell lines derived from nuclear families. We have detected a number of germ line CN alterations with proper inheritance from parents to children. We also applied our algorithm to 178 cancer cell lines derived from different tissues. We identified multiple regions with common homozygous deletions and high-level amplifications. As expected some of these regions harbor well-known oncogenes and tumor suppressor genes. Other regions do not encompass well-known cancer-related genes. Then we applied our algorithm to artificial mixes of DNA derived from cancer cell lines with DNA derived from normal cell lines. These mixes model contamination of tumor samples with noncancerous cells. The estimates of the contamination of cancer DNA with normal DNA produced by our algorithm for these mixes matched closely the corresponding experimental estimates using DNA quantitation by picogreen. Comparison between CN alterations detected by our algorithm in the mixed samples versus pure samples showed very high concordance in particular for contamination levels below 50%. Finally, using as an example a colon cancer tumor sample with about 50% contamination with normal DNA, we have demonstrated how erroneous results of CN analysis can be without properly taking into account the effect of the contamination.

There are several limitations to the presented approach. To start with, both Affymetrix and Agilent arrays require a minimum input amount of 500 ng genomic DNA. DNA yields from processed tumor and cell culture samples are highly variable, but typically a minimum of five 10 micron slices of tissue or 107 cells is required to guarantee sufficient DNA yield. While this approach is capable of detecting CN alterations, it cannot detect other aberrations in cell karyotypes such as reciprocal translocations and CN neutral loss-of-heterogeneity regions. This algorithm also cannot distinguish between germline CN variations and somatic CN changes in cancer cell lines and tumors. To be able to do that one has to use normal patient DNA as control. An important limitation of the presented algorithm is related to the assumption that all cancer cells within a cell line or a tumor sample have the same CN profile. This assumption seems to be correct for most of the cell lines and tumor samples we have analyzed. However, some samples (10%) are clearly heterogeneous; not all the cells in these samples share all CN alterations. For such samples our method produces erroneous results within the regions not shared by all cancer cells. Fortunately, these regions represent only a fraction (5%) of the genome and, therefore, do not impact significantly the results for the rest of the genome. Another limitation of the algorithm is the assumption of independence of signal noise for closely positioned probes (see (10)). However, it is reasonable to expect that the noise for closely positioned probes should be correlated. We do observe such correlation but the correlation appears to be negligible for Affymetrix 500K SNP arrays and Agilent 244K oligonucleotide arrays. On the other hand, we observed a very strong correlation for adjacent probes on custom Agilent arrays with extremely dense coverage of specific regions (data not shown). This correlation strongly affects the performance of our algorithm and, most likely, other published algorithms; in particular, multiple false CN alterations are being observed. It would be advantageous to modify the presented algorithm to take properly into account this correlation for arrays with dense coverage.