ABSTRACT

Statistical methods are vital to biomedical research. Our aim was to find out whether progress has been made in the last decade in the use of statistical methods in Chinese medical research. We reviewed 10 leading Chinese medical journals published in 1998 and in 2008. Regarding statistical methods, using a multiple 𝑑 -test for multiple group comparison was the most common error in the 𝑑 -test in both years, which significantly decreased in 2008. In contingency tables, no significant level adjustment for multiple comparison significantly decreased in 2008. In ANOVA, over a quarter of articles misused the method of multiple pair-wise comparison in both years, and no significant difference was seen between the two years. In the rank transformation nonparametric test, the error of using multiple pair-wise comparison for multiple group comparison became less common. Many mistakes were found in the randomised controlled trial (56.3% in 1998; 67.9% in 2008), non- randomised clinical trial (57.3%; 58.6%), basic science study (72.9%; 65.5%), case study or case series study (48.4%; 47.2%), and cross-sectional study (57.1%; 44.2%). Progress has been made in the use of statistical methods in Chinese medical journals, but much is yet to be done.

1. INTRODUCTION

Statistics play a key role in biomedical research [1–6]; their correct use is thus essential to a high-quality study. The misuse or inaccurate use of statistical methods may point the research in the wrong direction and produce incorrect study results.

China produces a large number of biomedical articles. According to the database of the Institute for Scientific Information (ISI), there has been a significant increase in the quantity and quality of Chinese biomedical publications in the last two decades, especially in the last decade [7]. However, it is common to find inappropriate statistical methods in Chinese medical journals. He et al. reported in 2009 that many more statistical errors existed in Chinese medical journals than in international journals [8].

Our previous study compared the research design, statistical analyses, and presentation and interpretation of results of 10 leading Chinese medical journals published in 1998 and 2008 in Chinese [9]. The main results we obtained were the frequencies of different types of study design, defective proportions in design and statistical analyses, and the inappropriate presentation and interpretation of results. Further, we mentioned that the most frequently used statistical methods were still the simple tests, although more sophisticated statistical methods were already being applied in 2008. As for the study design, our focus was primarily on retrospective studies, with clinical trials receiving relatively little attention.

In this research, we again used the 10 leading Chinese medical journals published in 1998 and 2008 and extracted new data on the misuse and inaccuracy of each statistical method as an extension. We listed and compared the most common errors of each method that appeared in medical articles in 1998 and 2008. We also compared the proportions of the incorrect use of statistical methods in different study designs between the two years; in our previous study [9], we had compared the proportions of design defects in various study designs. All statistical procedures and methods in each article were reviewed, and trends in the misuse of statistical methods were reported. We summarised the progress that had been made during the past 10 years and discussed the current concerns about Chinese medical journals. We analysed the possible reasons of the main errors and suggested some improvements on the quality of Chinese medical journals.

2. METHODS

2.1. The 10 Chinese Medical Journals Used

The 10 leading Chinese medical journals that we examined were Chinese Journal of Internal Medicine, Chinese Journal of Surgery, Chinese Journal of Pediatrics, Chinese Journal of Obstetrics and Gynecology, Chinese Journal of Ophthalmology, Chinese Journal of Hematology, Chinese Journal of Stomatology, Chinese Journal of Cardiology, Chinese Journal of Oncology, and Chinese Journal of Tuberculosis and Respiratory Diseases. These journals were peer-reviewed, sponsored by the Chinese Medical Association, and indexed by Medline; however, they were only published in Chinese. All the original articles publishedβ€”1,335 in 1998 and 1,578 in 2008β€”were reviewed [9].

2.2. Data Collection and Processing

We recorded four types of the most commonly used statistical methods in medical journals: 𝑑 -test, contingency tables, analysis of variance (ANOVA), and rank transformation nonparametric test. Some rarely used methods, such as mixed/multilevel model, analysis of covariance, and survival analysis were not analysed in this study. We listed the main errors that appeared in the four types of methods. The frequencies and percentages were adopted to describe the occurrences of misuse of statistical methods. The quantities and proportions of the incorrect use of statistical methods in the different study designs were estimated. The details of quality control have been described in our previous study [9] SAS 9.1.3 was used, and a Chi-square test was conducted to assess the trends. A 𝑃 value of less than 0.05 was considered to be statistically significant. The odds ratio (OR) and 95% confidence interval (CI) were also estimated.

3. RESULTS

3.1. Errors in the Different Statistical Methods

As described in our previous study, 492 and 570 articles used the 𝑑 -test in 1998 and 2008; 319 and 523 used contingency tables; 202 and 446 used ANOVA; 67 and 187 used the rank transformation nonparametric test [9]. The specific errors that occurred in the four types of statistical methods of both years are listed in Table 1.

The proportion of the misuse of the 𝑑 -test decreased significantly in 2008. The prevalence of using multiple 𝑑 -tests for multiple group comparisons (using multiple 𝑑 -tests to compare the means of more than two groups) was particularly disconcerting: 31.1% (153/492) in 1998 and 22.6% (129/570) in 2008 ( πœ’ 2 = 9 . 7 1 , 𝑃 = 0 . 0 0 2 , O R = 1 . 5 4 , 95% CI: 1.17 to 2.03). Two other major errors were the use of the 𝑑 -test under a nonparametric setting (from 18.1% (89/492) in 1998 to 10.5% (60/570) in 2008; πœ’ 2 = 1 2 . 5 2 , 𝑃 < 0 . 0 0 1 , O R = 1 . 8 8 , 95% CI: 1.32 to 2.67) and the use of the 𝑑 -test to conduct repeated-measure data analysis (from 14.8% (73/492) in 1998 to 10.5% (60/570) in 2008; πœ’ 2 = 4 . 4 8 , 𝑃 = 0 . 0 3 4 , O R = 1 . 4 8 , 95% CI: 1.03 to 2.13).

A declining trend was also observed in the inaccurate use of contingency tables. The two most common mistakes were no significant level adjustment for multiple comparison (from 25.7% (82/319) in 1998 to 14.2% (74/523) in 2008; πœ’ 2 = 1 7 . 5 3 , 𝑃 < 0 . 0 0 1 , O R = 2 . 1 0 , 95% CI: 1.48 to 2.98) and no continuity correction or Fisher exact test if needed (from 16.3% (52/319) in 1998 and to 10.1% (53/523) in 2008; πœ’ 2 = 6 . 9 0 , 𝑃 = 0 . 0 0 9 , O R = 1 . 7 3 , 95% CI: 1.15 to 2.61). The third most common mistake was using the Chi-square test for ranked data, where no decline was seen ( πœ’ 2 = 3 . 0 0 , 𝑃 = 0 . 0 8 3 , O R = 1 . 5 9 , 95% CI: 0.94 to 2.69).

Unfortunately, the incorrect use of ANOVA remained high in 2008. Over a quarter of the articles misused the multiple pair-wise comparison of post hoc ANOVA in both years ( πœ’ 2 = 1 . 3 0 , 𝑃 = 0 . 2 2 5 , O R = 0 . 8 0 , 95% CI: 0.55 to 1.17). However, two errors decreased in 2008: using ANOVA to analyse repeated-measures data and not using a multiple pair-wise comparison of ANOVA when needed; both reached a level of statistical significance ( πœ’ 2 = 6 . 6 5 , 𝑃 = 0 . 0 1 0 , O R = 1 . 7 4 , 95% CI: 1.14 to 2.67 and πœ’ 2 = 6 . 8 9 , 𝑃 = 0 . 0 0 9 , O R = 2 . 1 1 , 95% CI: 1.20 to 3.72, resp.).

There were two main errors in the rank transformation nonparametric test. One was the use of multiple pair-wise comparison for multiple groups ( πœ’ 2 = 4 . 4 3 , 𝑃 = 0 . 0 3 5 , O R = 2 . 2 1 , 95% CI: 1.04 to 4.47), although fewer errors of the sort were found in 2008. The other one, wherein the wrong type of rank sum test was used for different study types, it did not show a significant difference ( πœ’ 2 = 2 . 0 7 , 𝑃 = 0 . 1 5 0 , O R = 3 . 8 9 , 95% CI: 0.85 to 17.88).

3.2. Misuse of Statistical Methods in Different Study Designs

Despite the significant growth in use of statistical methods, substantive errors still existed in different study designs. Table 2 shows the quantities and proportions of the statistical methods used and the errors that were found. Errors mainly occurred in clinical trials, basic science study, and retrospective study.

In the clinical trials, over half of the articles with statistical methods had mistakes in both years. No statistical significance was seen in the clinical trials during the last 10 years for randomised controlled trials ( πœ’ 2 = 1 . 7 0 , 𝑃 = 0 . 1 9 2 , O R = 0 . 6 1 , 95% CI: 0.29 to 1.29) and nonrandomised clinical trials ( πœ’ 2 = 0 . 0 2 , 𝑃 = 0 . 8 7 8 , O R = 0 . 9 5 , 95% CI: 0.48 to 1.87). A mass of statistical errors existed in basic science study, which was used frequently in both years. The proportions of errors were 72.9% (175/240) in 1998 and 65.5% (268/409) in 2008 ( πœ’ 2 = 3 . 8 1 , 𝑃 = 0 . 0 5 1 , O R = 1 . 4 2 , 95% CI: 1.00 to 2.01). The situation was equally worrisome in retrospective study, case-control study, and case study or case-series study. Although a downward trend in mistakes was seen in case-control study ( πœ’ 2 = 7 . 0 5 , 𝑃 = 0 . 0 0 8 , O R = 1 . 5 9 , 95% CI: 1.13 to 2.24), there was no significant improvement in case study or case-series study ( πœ’ 2 = 0 . 0 4 , 𝑃 = 0 . 8 3 7 , O R = 1 . 0 5 , 95% CI: 0.68 to 1.62). It was gratifying to see a significant drop in the proportion of errors in cohort study ( πœ’ 2 = 1 9 . 0 1 , 𝑃 < 0 . 0 0 1 , O R = 5 . 4 6 , 95% CI: 2.48 to 12.05), but no improvement was observed in cross-sectional study ( πœ’ 2 = 1 . 8 0 , 𝑃 = 0 . 1 8 0 , O R = 1 . 6 8 , 95% CI: 0.79 to 3.60).

4. DISCUSSION

4.1. Possible Reasons for the Occurrence of Errors

Among the errors, the biggest problem was the inappropriate choice of statistical methods. The possible reason for this was that not much attention was paid to the distributional characteristics of the variables and the nature of the data. Apparently, due to the researchers’ lack of basic knowledge of statistics, they ignored the application condition of a certain method. When the quantitative data did not meet the prerequisites for parametric tests, they blindly applied the tests. Many researchers mistakenly believed that the Chi-square test was a universal tool for dealing with contingency tables, and they used it on data without taking the data characteristics into consideration. Some multifactorial experimental studies were split into a series of single-factor studies, which dissevered the intrinsic link or interactions among factors and led to one-sided or even wrong conclusions. Park et al. stated that the selection of the correct statistical method depends on the data structure and underlying statistical assumptions [10]. However, some errors were very common among articles, and they were wrongly cited or used by others, resulting in a vicious circle. As Altman DG said, β€œonce incorrect procedures become common, it can be hard to stop them from spreading through the medical literature like a genetic mutation” [11].

4.2. Correct Methods Should Be Used in These Situations

Regarding the 𝑑 -test, the most frequent error was using multiple 𝑑 -tests for multiple group comparison, which may increase the probability of making a Type 1 error. there are several methods for multiple comparison, such as the Bonferroni method, ScheffΓ© method, Tukey method, Newman-Keuls method, and Duncan method [12]. Around 4.95% and 6.95% of the articles which used ANOVA in 1998 and 2008 employed one-factorial ANOVA to analyse data from multifactorial designs. One-factorial ANOVA is used when there is only one experimental factor; when two or more experimental factors are involved, multifactorial ANOVA should be used [13, 14]. The 𝑑 -test and standard ANOVA require independent data that have no correlation with each other. Repeated-measure data do not meet this requirement; instead, repeated-measures ANOVAs or mixed-effects models should be used. Mixed-effects models are recommended, as they have greater flexibility to model time effects and can handle missing data more appropriately [15]. A common error encountered in contingency tables in both years was that there did not exist continuity correction or Fisher exact test if needed. It is considered incorrect to use the Chi-square test directly in contingency table analysis if the total sample size is not more than 20, or if more than 20% of the expected frequencies are less than five; Fisher’s exact test should be applied in both cases [16]. Nonparametric tests are often used in place of parametric tests when the assumptions of the parametric test have been grossly violated (e.g., if the distributions are too severely skewed.) Nonparametric tests are also recommended for small sample sizes or data sets with many ties. The error proportions of using the Chi-square test for ranked data were 9.09% and 5.93% in the two years. Instead of a Chi-square test, a rank transformation nonparametric test should be used on ranked data. For study designs, the more complex the study design was, the more mistakes in statistical methods were likely to appear.

4.3. Progress and Worries

In general, progress has been made in the statistical methods of Chinese medical journals in the last decade. The percentage of articles using statistical methods has increased, and the proportion of errors has significantly decreased in most of the statistical methods and study designs. This conclusion was consistent with what Wang et al. reported in 1998 that the proportion of papers in Chinese medical journals using appropriate statistical methods had increased in 1995 compared with 1985 [17]. From this point, we can see that Chinese medical researchers have made great efforts to employ statistical methods in their studies. However, we cannot be overoptimistic because the situation is very far from satisfactory. Although statistical errors also exist in the medical journals of western countries, the proportion is smaller. McGuigan reviewed all papers published in the British Journal of Psychiatry in 1993 and found that 40% of the papers contained statistical errors [18]. Welch and Gabbe reviewed 145 clinical articles published in American Journal of Obstetrics and Gynecology in 1994 and pointed out 46 articles (31.7%) that were deemed to have applied statistics inappropriately [19]. Kurichi and Sonnad reported that only 27% of the studies in five selected surgical journals of America in 2003 included incorrect selection or reporting of statistical methods [20]. Another study, conducted by Neville et al., assessed the frequency of statistical errors in dermatological literature. The study revealed that only 14% of the articles with statistical analysis contained errors in the methods; 26.5%, in the presentation of the results; 2.6%, in both [21].

In China, the situation is quite depressing, as the error rate of statistical methods remains high. In ANOVA, the total error rate hit approximately 60% in 1998 and 2008. Many mistakes were made even in the most basic aspects. For instance, 31.10% of articles used 𝑑 -test for multiple group comparison in 1998 and 22.63% in 2008. In clinical trials, over half of the articles had statistical errors. The proportion of errors was extremely high in basic science study and retrospective study, even if these were frequently used.

In addition, many sophisticated statistical methods, such as analysis of covariance, repeated-measures analysis, logistic regression, and survival analysis were seldom used in Chinese medical journalsβ€”an observation that was also made by Wang and Zhang in their study [17]. This suggests that a large amount of data is not being efficiently analysed, so that much of the information is wasted. Considering the high incidence of errors in the simple statistical methods, it is not hard to imagine how bad the situation is with regard to sophisticated statistical methods. Moreover, since we studied only the 10 leading medical journals, it is likely that our results were above the actual average of Chinese medical journals. It must be noted, though, that some Chinese research papers published in international medical journals, whose statistical methodology might be of better quality, were not included in this study. Thus, as our next step, we intend to conduct a survey on Chinese clinical studies that have been published in international journals.

5. RECOMMENDATIONS

The 10 medical journals we selected are representative of excellent Chinese medical journals. Nevertheless, there is still a wide gap between them and the international journals with respect to statistics. Some measures are needed to decrease the errors in the statistical methods and improve the quality of articles.

Firstly, clinicians and medical researchers should correct their attitude about writing. Their purpose of publication should be to make their results known to their colleagues and raise the level of medical science of mankind; it should not be personal aggrandisement. Only correct medical outcomes can benefit people. Secondly, statistical education should be enhanced among clinicians and researchers; they should have a basic concept of statistics and study design. An integrated and detailed protocol should be made beforehand. And when performing and analyzing RCTs, CONSORT statement is recommended as a guideline, which is accepted internationally now. Thirdly, statisticians should assume an important role in the research; in other words, a research group should include a statistician as a consultant. Finally, statistical reviewers should be included in the editorial boards of the journals. Some journals merely intend to make profits through page charges, publishing random articles without taking quality into account. Measures must be implemented to prevent such practices.

ACKNOWLEDGMENTS

This study was supported by the β€œStrategic Priority Research Program” of the Chinese Academy of Sciences, Grant no. XDA01040100, the Ministry of Science and Technology of China (2008ZX09312-007, 2009ZX09312-025, 2008ZX10002-018). The authors thank the following members for data collection and group discussion: Xiaofei Ye, Xiaojing Guo, Xinji Zhang, and Jiajie Zang. Shunquan Wu and Zhichao Jin contributed equally to this study.