Objective. Use of cancer cases from registries and PMSI claims database to estimate Département-specific incidence of four major cancers. Methods. Case extraction used principal diagnosis then surgery codes. PMSI cases/registry cases ratios for 2004 were modelled then Département-specific incidence for 2007 estimated using these ratios and 2007 PMSI cases. Results. For 2007, only colon-rectum and breast cancer estimations were satisfactorily validated for infranational incidence not ovary and kidney cancers. For breast, the estimated national incidence was 50,578 cases and the incidence rate 98.6 cases per 100,000 person per year. For colon-rectum, incidence was 21,172 in men versus 18,327 in women and the incidence rate 38 per 100,000 versus 24.8. For ovary, the estimated incidence was 4,637 and the rate 8.6 per 100,000. For kidney, incidence was 6,775 in men versus 3,273 in women and the rate 13.3 per 100.000 versus 5.2. Conclusion. Incidence estimation using PMSI patient identifiers proved encouraging though still dependent on the assumption of uniform cancer treatments and coding.

1. Introduction

In being responsible for about 350,000 new cases and 150,000 deaths yearly in France, cancer is a major health problem and its surveillance the utmost public health concern. Regarding surveillance, FRANCIM, the French network of cancer registries, is responsible for exhaustive collections of cancer cases in 10 to 14 French Départements (depending on the cancer type) corresponding to 15% to 20% of the French population. However, estimating epidemiological indicators at the scale of whole Départements over the country is necessary not only to reveal etiological factors and geographical or social discrepancies but also to plan the needs in terms of medical resources (prevention, treatment, and surveillance).

Over the previous ten years or so, FRANCIM and the Department of Biostatistics of the Hospices Civils de Lyon have been providing national estimations of cancer incidence [1, 2]. Their usual approach to produce these estimations is to use registry incidence data together with CépiDC mortality data (Centre d’Epidémiologie sur les Causes médicales de Décès) [3]. The principle is to calculate a mean ratio between incidence and mortality in an area covered by cancer registries then use that ratio with national mortality data to derive an estimation of national incidence. While the mean ratio estimated in a registry area can be reasonably considered as representative of the ratio for the whole country, the same is not true at the level of a single Département because this ratio may be highly variable between Départements and because identical incidence values do not necessarily lead to identical mortalities. Indeed, a great number of factors are able to affect patient survival and generate heterogeneity of the ratio between Départements: differences in patient management (diagnostic or therapeutic procedures), prevention or screening policies, or compliance of the population with these policies. Therefore, because incidence and mortality cannot be used to provide Département-specific incidence estimations, a new approach should be sought for.

One interesting source of data with a nationwide coverage has been recently used together with registry data to estimate Département-specific cancer incidence: the hospital database of the Programme de Médicalisation des Systèmes d'Information Médicale (PMSI) [46]. This medicoadministrative database is held to help manage health institutions and provide budget estimates.

A previous work [7] has discussed the problem of using hospital stays from PMSI data to estimate Département-specific incidence of breast cancer. However, the constant improvement of the quality of patient identification in PMSI data using a single-patient identifier makes it possible now to use patient-specific rather than stay-specific data

In the present paper, our objective was to estimate Département-specific incidence of colon-rectum, breast, kidney, and ovary cancers for 2007 using mean ratios of PMSI-extracted cases to registry-extracted (incident) cases.

2. Materials and Methods

2.1. PMSI Database, Hospital Stay-Specific Data, and Patient-Specific Data

The French Agence Technique de l'Information sur l'Hospitalisation (ATIH) made available its data on all short stays in all health institutions over France for 2002–2007. In our analyses, we kept the variables related to personal characteristics (sex, age, and code of the residence area), hospital stay (stay number and principal diagnosis according to the International Classification of Diseases (ICD-10) [8], and medical procedures according to the Catalogue des Actes Médicaux (CdAM) until 2004 [9] and to the new Classification Commune des Actes Médicaux (CCAM) [10] from 2004 to 2007, plus an anonymous alphanumerical patient identifier [11, 12] to allow chaining of hospital stays of the same patient in successive institutions. That identifier is systematically generated by procedure FOIN (Fonction d’Occultation des Informations Nominatives) in all French health institutions since 2001. Hospital stays with no available patient identifier were excluded from the present analyses.

Two algorithms were independently used to extract hospital stays to analyze. The first one extracted all stays with cancer as principal diagnosis. The corresponding CIM 10 codes were C18 to C21 for colon-rectum cancer, C50 for breast cancer, C56 and C57.0 to C57.4 for ovary cancer, and C64 to C66 plus C68 for kidney cancer. The second algorithm extracted stays with cancer as principal diagnosis and with surgical procedures for cancer. Using CdAM and CCAM codes, the latter extraction considered 95 procedures for colon-rectum cancer, 31 for breast cancer, 114 for ovary cancer, and 44 for kidney cancer.

After each type of extraction, hospital stays were ordered by their serial numbers to spot the first stay of each patient, then only these stays were kept for analysis. These stays were then counted over each Département by age group: ten groups for colon-rectum cancer (15–44, 45–49, 50–54, …, 80–84, and ≥85 yrs), eleven groups for kidney cancer in women (15–39, 40–44, 45–49,…, 80–84, and ≥85 yrs), and thirteen groups for each of breast cancer, ovary cancer, and kidney cancer in men (15–29, 30–34, 35–39,…, 80–84, ≥85 yrs).

2.2. Registry Incident Cancer Cases

FRANCIM network made available the data on incident cancer cases registered in 2004 (the most recent and checked data set when the present study was initiated).

Cancer sites were determined according to the International Classification of Diseases for Oncology, third version (ICD-O-3) [8] and corresponded to invasive tumors. These codes were: C18 to C21 for colon-rectum cancers, C50 for breast cancer, C56 and C57.0 to C57.4 for ovary cancer (excluding morphological codes 8442/3, 8451/3, 8461/3, 8462/3, 8472/3, and 8473/3), and C64 to C66 plus C68 for kidney cancer.

The cancer registries used were those of eleven Départements: Calvados, Côte d’Or, Doubs, Hérault, Isère, Loire-Atlantique, Bas-Rhin, Haut-Rhin, Saône-et-Loire, Somme, and Tarn. Incidents cases of cancers were counted by the same age group as for hospital stays.

2.3. Modeling the Ratio of PMSI Cases to Incident Cases

Our approach was to model, in function of age, the PMSI cases/incident cases ratio; that is, the ratio of the number of patients with hospital stays for cancer present in the PMSI database to the number of incident cases present in the registries. This ratio was obtained from Départements where the two sources of information exist (i.e., Départements with a registry). It was then applied to PMSI data of Départements without registry, in order to estimate cancer incidence in these areas. This ratio was obtained from registry and PMSI data of year 2004; it was then applied to PMSI data for 2007 to derive Département-specific incidence for 2007.

The method adopted to model that ratio was detailed by Remontet et al. [7]. It is a calibration method where incidence, as obtained from cancer registries, is considered as reference or “true” value whereas PMSI cases allow only an approximation of this value. More precisely, the ratio is modelled as a function of age (effects smoothed using cubic regression spline) and Département (considered as random effect).

Further, to estimate the PMSI cases/incident cases ratio, a data quality criterion was required: that the chaining rate be greater or equal to 95%. This chaining rate was defined as the ratio of the number of stays with personal identifier to the total number of stays.

The analysis was carried out for the four cancer sites in women but only for colon-rectum cancer and kidney cancer in men. Whenever the number of cancer cases was small, it was not possible to take into account between-Département variability in areas with registries. We had then to sum up data from several Départements and calculate the PMSI cases/incident cases ratio per age group over the entire zone.

The overall national incidence was calculated by summing all Département estimations. For validation, we compared these national incidence values to FRANCIM values previously obtained by modeling the incidence/mortality ratio [1, 2, 13].

2.4. Validation of the Estimations

Validation was carried out through three steps. In step 1, the PMSI cases/incident cases ratio was calculated over all ages using each algorithm per cancer site-sex combination and Département. Indeed, Département-specific estimations stemming from this approach are invalid unless the ratio is homogeneous between Départements (always >1 or always <1). In step 2, the observed ratio for a given age group and Département was graphically compared to the modelled mean ratio over all Départements with registries (Figure 1). However, to be applicable to all Départements, the modelled mean ratio should not suffer a “Département effect”. In the presence of this effect, the observed ratios per age groups tend to be systematically higher or lower than the mean ratio whereas in its absence, the observed ratios are distributed around the mean ratio. For cancer sites with high incidence (breast, colon-rectum), step 3 was a cross-validation [7]. The incidence in a given Département with registry is estimated from PMSI data together with the PMSI cases/incident cases ratio obtained by a model from which the data of that Département were excluded. A comparison between the number of observed cases and the number of predicted cases yields a Prediction Error (PE) [7]. Under hypothesis H0 of a correct prediction, the PE obeys a χ² rule whose degree of freedom is equal to the number of age groups. A 5%  α-risk was adopted to set the critical value for rejecting H0. For cancer with low incidence (ovary, kidney), cross-validation could not be used, because it was difficult to determine the statistical distribution of the PE and so only graphical validation was done. In addition, a comparison between the total numbers of observed and predicted cases was carried out (χ² with one degree of freedom). A relative error (RE) was calculated as the difference between the observed and predicted cases divided by the number of observed cases.

The results were mapped as standardized incidence ratios (SIR) using software MAPINFO (version 7.0).

3. Results

In 2004, the PMSI database was including 20,721,587 hospital stays of which 718,044 stays had cancer as principal diagnosis. The chaining rate of those stays was 96.4%. In 2007, there were 21,201,102 stays of which 721,823 had cancer as principal diagnosis and the chaining rate was 99%.

To illustrate our validation procedure the analyses relative to breast cancer were plainly detailed whereas those relative to the other cancer sites were less detailed.

3.1. Breast Cancer

Six registries were selected to estimate the incidence of breast cancer (Table 1). Four Départements with registries were excluded because, in 2004, their chaining rate was too low; it ranged between 49.7% and 94.9%.

Table 1 shows the total number of cases by cancer site and case-extraction algorithm as well as the three steps of estimate validation: homogeneity of the PMSI cases/incident cases ratio through Départements with ratio <1, the Département effect, and the cross-validation.

Regarding the homogeneity of the ratio, the results show that algorithm 2 was inadequate. Indeed, whereas with algorithm 1 the number of PMSI cases was always higher than the incident cases in all Départements, with algorithm 2, the former number was sometimes higher (2 Départements) and sometimes lower (4 Départements) than the latter.

Regarding the Département effect, Figure 1 presents, for each Département with registries and for algorithm 1, the observed ratio by age group as well as the modelled mean ratio over all Départements with registries. The absence of heterogeneity in that ratio between Départements is illustrated by the fact that there was no Département in which the ratio per age group was systematically higher or lower than the modelled mean ratio (though Département Calvados tend to be systematically higher than the modelled mean ratio). It can be therefore concluded that there is no Département effect for breast cancer with algorithm 1 (the variance of the random effect was small: 0.023).

Regarding the cross-validation, Table 2 presents the detailed results for breast cancer. With algorithm 1, the differences between the observed and the predicted number of cases per age group (PEs) were small whereas with algorithm 2, two Départements displayed large differences especially concerning the last age group. Furthermore, the difference between the total observed and the total predicted cases (χ²) was small with algorithm 1: thus, this algorithm 1 may be reliably used for breast cancer estimates.

Table 3 shows the national estimations obtained by adding the estimations obtained from all Départements by algorithm 1 as well as the national estimations elaborated by FRANCIM [14] through the use of the incidence/mortality ratio [1, 2, 13]. Comparing these two estimations was one way to validate the results of the present analysis.

The national incidence of breast cancer for 2007 was estimated at 50,578 cases. The World age-standardized incidence rate (WASR) was 98.6 cases per 100,000 and varied between Départements from 71.4 to 127.1 (Table 4).

3.2. Colon-Rectum Cancer

To estimate Département-specific incidence of colon-rectum cancer, we used data from eight registries. Data from three Départements with registries were excluded because the chaining rate in 2004 was too poor (72.6% to 94.9%).

Irrespective of sex, the PMSI cases/incident cases ratios obtained with algorithm 1 were homogeneous (the variance of the random effect was small: 0.024), which was not the case with algorithm 2 for which there was a Département effect in both gender. In addition, contrary to algorithm 1, cross-validation invalidated the estimations made with algorithm 2 in two Départements. Thus, we present only estimations made with algorithm 1. The national incidence for 2007 was estimated at 21,172 cases in men and 18,327 cases in women. The estimated national WASR was 38 cases per 100,000 in men and 24.8 in women. At the national level, our estimations were in high agreement with those of FRANCIM (40.8 cases per 100,000 in men and 24.8 cases per 100,000 in women).

Among Départements, the WASR ranged from 23.1 to 48.9 in men and from 17.4 to 31.3 in women.

3.3. Ovary Cancer

The estimations of Département-specific incidence used six cancer registries. Three Départements with registries were excluded because the chaining rate was too poor (80% to 93.5%).

The PMSI cases/incident cases ratio carried out with algorithm 1 yielded homogeneous ratio between Départements, and no Département effect was observed. This was not the case with algorithm 2 for which heterogeneity was observed. However, as already mentioned in section Methods, cross-validation could not be carried out to validate estimate from algorithm 1 and thus, because of difficult formal validations of the Département-specific estimations, we only present national estimations based on this algorithm 1. The national incidence was 4,637 cases, which corresponds to a WASR of 8.6 cases per 100,000.

3.4. Kidney Cancer

The quality of data chaining in Départements with registries differed according to sex. Thus, eight Département registries were considered for men (one registry excluded because of a 92.5% chaining rate) and five registries for women (four registries excluded because of chaining rates ranging between 89.7% and 94.4%).

Here too, only a graphical validation of the estimations could be carried out and, because of difficult validations of Département-specific estimations, only national estimations are given.

As for ovary cancer and irrespective of sex, algorithm 1 performed better than algorithm 2. In 2007, the national incidence of kidney cancer was estimated at 6,775 in men and 3,273 in women; the national WASR was 13.3 per 100,000 in men but much lower (5.2 per 100,000) in women.

3.5. SIR Maps

Département-specific estimations for colon-rectum and breast cancer are shown in Table 4 and SIR maps of these cancers are shown in Figures 2, 3, 4. These maps were not constructed for ovary and kidney cancers because of difficult formal validations of Département-specific incidence.

Overall, no clear geographical gradient could be seen. However, for colon-rectum cancer in men, the southwest was a low incidence area. This area was much larger in women. One well-marked low-incidence area for breast cancer was the southwest quadrant of France.

4. Discussion

To estimate the incidence of the four cancers in each Département, case extraction from PMSI database used two algorithms. algorithm 1 targeted all hospitalized cancer patients; that is, those whose principal diagnosis is cancer because of positive laboratory tests, metastasis staging, cancer-related procedures, or sudden potentially fatal progression (exacerbation or relapse). Thus, some prevalent cases were included along with incident cases. This is common in incidence estimations based on hospital data [1517] and was confirmed here: the number found in PMSI data was higher than incident cases found in the registries (truer with algorithm 1 than with algorithm 2). Nevertheless, the PMSI cases/incident cases ratio obtained with algorithm 1 seemed somehow stable between Départements, which allowed estimations of Département-specific incidence. In contrast, algorithm 2 that used initial surgical procedures was more selective; it extracted a closer number of PMSI patients to the number of incident cases than algorithm 1. Thus, the PMSI cases/incident cases ratio was more heterogeneous between Départements (at different degrees according to the cancer site). Besides, cross-validation revealed that incidence estimations in some Départements with registries were not valid. Indeed, with algorithm 2, the critical value of the prediction error was crossed for colon-rectum cancer in both sexes and for breast cancer in women. In sum, the simpler and less selective algorithm 1 was more adequate than algorithm 2 to estimate Département-specific incidence.

At the national level, our estimations were in agreement with FRANCIM projections [13], except for breast cancer (our estimations were lower). This was expected because estimations using incidence/mortality ratios do not take into account the recent trend towards a slow decline of breast cancer incidence in France [18] and in other countries [19, 20]. Our estimation made in 2007 at 50,578 new yearly cases in France seemed thus more realistic than the 52,492 cases stemming from FRANCIM projections. This comparison is interesting because it shows that in average, over all Départements, our approach leads to reliable estimations. Finally, our graphical validation cannot constitute a formal validation because, in borderline situations, one cannot claim a “Département effect”. The graphical method would be only as a tool to detect important departures from the model assumptions.

PMSI data concern only hospitalized patients; thus, our method does not apply to cancers such as skin melanomas or basocellular cancers, which are usually treated early without hospitalization. Besides, the method supposed identical treatment choices in all Départements; which motivated the choice of the four cancer sites [21]. In two successive articles on colon and rectum cancers [22, 23], Phelip et al. have shown that surgical resection was performed in 90% of cases without significant geographical variation between Départements. A high variability in treatment choices leads to problems with case-extraction from the PMSI database. For example, in aged men, prostate cancer can be treated by surgery, radiotherapy, or even hormone therapy alone (without hospitalization) and, in bladder cancer, the surgical treatment depends on the stage. If surgery is avoided, the lack of “surgery for cancer” in the PMSI database leads to missingness of PMSI cancer cases.

In addition, our method lays fundamentally on the hypothesis that within a given age group, the PMSI cases/incident cases ratio is constant between Départements whereas several factors may affect that rate. First, chaining of hospital stays is essential. Indeed, in a previous work [7], an insufficient chaining led to consider the number of stays rather than the number of patients. Consequently, the between-Département variability in the mean number of stays per patient—essentially due to very different hospitalization policies and coding practices—prevented correct estimations. In 2002, the chaining rate was quite low (92%) but improved up to 2004 (nearly 96%) because of the implementation of a “Tarification A l’Activité” (a prospective payment system). This rate further improved nationwide up to 2007. This high quality will soon become a norm. The use of a single-patient identifier allows keeping a single record per patient whatever the number of hospital stays and insures a better homogeneity of PMSI data.

Variability may also stem from various coding habits in different health institutions because of ignorance or misinterpretation of coding rules. A wide interinstitution variability of the PMSI cases/incident cases ratio may lead to a wide between-Département variability, especially between Départements with few health institutions.

Besides, the PMSI cases/incident cases ratio reflects the fact that for a given incidence level, the prevalence may vary between Départements. Indeed, different rates of cancer-specific survivals between Départements affect hospital prevalence; thus, the number of PMSI cases. In fact, survival in a given Département may vary with the presence/absence of systematic screening, the existence or not of a reference health institution [24], and the educational [25] and socioeconomic levels of the population [26]. If survival is high because of complete cures, hospital prevalence and PMSI cases will decrease, which will underestimate incidence, but if survival is high but associated with more procedures, hospital prevalence and PMSI cases will increase, which will overestimate Département-specific incidence. The impact of different survival rates between Départements on hospital prevalence is complex to seize and undoubtedly dependent on the cancer site under study.

Another implicit hypothesis in our method is a constant PMSI cases/incident cases ratio from 2004 to 2007. This is plausible because data quality of both sources over that short period was deemed constant despite improvements in cancer therapies that would have changed the prevalence of cancer. Another effect of time is the change in the national standards of coding PMSI data. For example, coding palliative care as “cancer” has been replaced by a specific ICD-10 code (Z51.5). More recently, in 2009, the rules for the choice of the principal diagnosis have changed; the impact of that change should be evaluated. Nevertheless, that impact would be limited within the context of the four cancers under study here.

Improvements of the present method are possible. A better followup of the same patients over several years would exclude a number of prevalent cases. The ratio would vary less between Départements. Ideally, if all prevalent cases were excluded, the ratio would be interpreted as the proportion of hospitalized incident cases and would be no more affected by different prevalence in different Départements. Another improvement would be to add data from the health insurance (Affections Longue Durée (ALD30) database of Caisses d’Assurance Maladie); a feasibility study of that possibility is underway.

5. Conclusion

Using an adequate method, it seems now possible to estimate Département-specific incidence of some cancers for a given year. A validation procedure should accompany these estimations. Nevertheless, this validation is only partial because Département-specific estimations will still suffer the basic assumption of similar coding practices in all hospitals.


WASR:World age-standardized incidence rate
SIR:Standardized incidence ratio.


This project was supported by a grant from the Institut National du Cancer, France. The authors thank FRANCIM network for sharing its incidence database.