Abstract

Background. Survival analysis attracted the attention of different scientists from various domains such as engineering, health, and social sciences. It has been widely exploited in clinical trials when comparing different treatments looking at their survival probabilities. Kaplan–Meier curves plotted from the Kaplan–Meier estimates of survival probabilities are used to depict the general image for such situations. Methods. The weighted log-rank test has been dealt with by suggesting different weight functions which give specific strength in specific situations. In this work, we proposed a new weight function comprising all numbers at risk, i.e., the overall number at risk and the separate numbers at risk in the groups under study, to detect late differences between survival curves. Results. The new test has been found to be a good alternative after the FH (0, 1) test in detecting late differences, and it outperformed all tests in case of small samples and heavy censoring rates according to the simulation studies. The new test kept the same strength when applied to real data where it showed itself to be among the powerful ones or even outperforms all other tests under consideration. Conclusion. As the new test stays stronger in the case of small samples and heavy censoring rates, it may be a better choice whenever targeting the detection of late differences between the survival curves.

1. Introduction

Survival analysis has so many applications in the real world such as engineering like testing the lifetime of life bulbs, medicine like testing the efficiency of different treatments, and it finds even its role in social sciences. In medical research studies, the comparison of two medical treatments is of crucial importance because they help to decide on which treatment works better than another. This is where the comparison of survival curves has its role.

The comparison of survival curves is done when two or more samples are submitted to different treatments or drugs. When comparing drugs, they test them on parallel groups and they decide which one is more efficient. Efficiency may be referred to as the time it takes to cause positive effect if any and at which percentage. For the comparison of survival curves, we consider and record the survival probabilities at each instant of interest for the groups or samples under consideration and we draw the Kaplan–Meier curves and compare them using different techniques. Different scenarios are explored and some tests are more powerful in specific scenarios accordingly. Such scenarios are proportional hazards, early differences, and late differences. Some also include middle differences even though they do not attract the attention of many and this may probably be due to the fact that it rarely happens. The test that is explored in this research is appropriate while investigating the late differences between curves.

2. Materials and Methods

2.1. Weighted Log-Rank Test

The weighted log-rank test is sometimes used in testing the equality of survival distributions. Taking the case of two groups or two treatments, the type of hypotheses that are being tested is of the following form:

for all t, against

for some t, where is the survival in group i at time t.

In the case of nonproportional hazard rates, the comparison of survival curves is preferably done using different weighted log-rank tests. The weight function is of crucial role, and its misspecification leads to inaccurate results and will cause the loss of power of the test.

The weighted log-rank statistic is written in a stochastic integral form by the following quantity:where is the total time of the study, w (t) is the weight function at time t, is the number of items/individuals at risk at time t in the group, is the overall number of items/individuals at risk at time t, and is the number of items/individuals which underwent the event of interest by time t in the group [1]-[2].

The variance of this weighted log-rank statistic is estimated by the quantity:where .

Computationally, the weighted log-rank statistic is written as follows:where is the weight at time , is the overall number of items/individuals at risk at time , is the number of items/individuals at risk at time in the group, is the number of events of interest at time in the group, and is the overall number of events of interest at time .

The statistic U is such that its expected value is E [U] = 0 and , and hence, the statistic to be computed becomeswhere is the overall number at risk in both groups at time and is the number at risk in the group at time . We recall that the statistic mentioned above is asymptotically chi-square distributed and can be reduced to a normal distributed statistic as follows:

The weighted log-rank test statistic contains all three quantities, while the weights considered by different researchers were based on transformed differently or the overall survival probability [3]. Even the survival probabilities considered were the overall ones for the overall sample.

One of the famous weight functions is displayed in Table 1.

Various modifications and improvements have been made to get more powerful weight functions. For example, Garès et al. [4] used the family of tests which was proposed by Fleming and Harrington [5] to investigate the late effects in controlled trials. There exists another test statistic found from a given number of FH statistics tests and it is called Max-combo test statistic [6]-[7]. This test is calculated as the maximum (linear) combination of a selected set of FH tests , , , and . This technique was introduced because nearly each test statistic has high power in a specific situation, and it would be more helpful to know the situation before.

However, it is not easy to know if in the situation under study, there are early or late effects. FH (0, 1) is more powerful in the case of late effects or late separation of survival curves, while FH (1, 0) becomes more powerful in the case of early effects or early separation of the survival curves. The lack of prior knowledge about the (location of) effects is the cause of using the combination of two or more tests in order to capture every feature [7].

According to the work done by Rückbeil et al. [6], they dealt with the Max-Combo test statistic from three standardized FH tests which are , , and under five different randomization procedures. They compared the separate FH tests and Max-Combo test, and it was found that the Max-Combo test in each case was the second in power where the highest power of Max-Combo of 83 was observed when they were assessing late treatment effects.

The study Lee [8] has dealt with the standardization of the weighted log-rank test statistics and the Max-combo test statistics Lin et al. [9]. This is the statistic divided by the square root of its variance estimate. Three cases were considered for multiple standardized weighted log-rank test statistics. Considering the corresponding Z statistics and from and , respectively, as studied by [8]; the three cases are as follows:(i)The average of the absolute values. This is, .(ii)The absolute value of the average. This is, .(iii)The maximum of the absolute values. This is, .

Lee [10] evaluated the maximum and average of , , , and . Karrison [11] considered Max , where the Z statistics , and were from , , and . This combination covers a good range of possibilities including early differences or late ones and proportional hazards features.

Abou-Shaara [12] studied the similarities between the Kaplan–Meier and ANOVA in his work, and he finally found that the two methods lead to the same conclusion.

There can be a need of estimating the confidence interval of the estimated probability [13], and it is found as follows:where is computed according to Greenwood’s formula as follows:

Klein et al. [14] proposed a test called a naive test of the null hypothesis for some fixed time points. Such test might be obtained from cumulative hazards or survival probabilities .

Qian and Zhou [15] proposed a family of hazard rate functions of hyperbolic-cosine-shaped (CH) type and the deduced CH class weight functions generated good statistic tests for the late differences detection.

2.2. New Weight Function

The existing weight functions are built-in functions of and, hence, vary in function of the total remaining number of individuals at risk in general. The use of transformed in different ways shows that only the size of the total number of individuals at risk in general is taken into account. However, the separate numbers and of individuals at risk in each of the groups would be involved and may probably help to capture more features. The involvement of and separately in the weight will help to detect the difference in the occurrence of the event interest in the two groups at each time point depending on the relation between the two numbers. There is, therefore, a need of a new weight function comprising simultaneously , , and which will change in function of the three variables and hence probably take into account the variations between and . This new weight is thought of being more adaptive since it captures, to some extent, the difference in variations between and by itself and it will be relatively small (big) for small (big) differences in the two quantities. In other words, if the occurrences are likely equal in both the groups, the weight will be relatively less heavy than when the occurrences will be higher in one group than another. While was considering the overall change (and hence general occurrences), separate changes in numbers of individuals at risk in the respective groups are needed for the search of more accuracy and precision of the test.

The new weight function that has been proposed in this study is of the following form:and according to its form, this weight function is monotone increasing. For different couples whose sum is , the new weight will be relatively higher as the difference between and increases compared to when the two numbers are nearly equal.

The stochastic form of the first statistic will be reduced towith its corresponding variance which is as follows:

From the direct observation, it can be seen that this statistic depends on the variations in numbers of events in the respective groups, which may lead to the probable predicted sensitivity.

Substituting the new weight function in the general weighted log-rank statistic, we obtain the new statistic which is as follows:or simply

2.3. Power and Relative Efficiency of a Test

The power of the test statistic is by default expressed as follows: , where is the probability of type two error. With the statistic of the weighted log-rank test, we have quantities which help to get the power. Assuming the quantity found in the numerator, we have the corresponding variance on the denominator, and they are such that [16]. The power of the test statistic is then computed as follows:

Since the value is also one among the methods of testing the hypothesis, it is good to recall how it is found from the two statistics. With U and V, the one-side value is calculated as follows:

[17].

Having two weighted log-rank statistics and , the ARE of relative to as proposed by Jiménez et al. [18] is given bywhere is the quantile function of the standard normal distribution and .

Computationally, the power of the Z statistic obtained from the log-rank test is found as follows:where M is the number of simulations which were performed (example: 10,000, 5,000, 1,000, …), and in our computations, we used M = 5000.

3. Data Analysis

3.1. Simulation Study Scenario

The ideal illustration of late separation is depicted in Figure 1. To carry out the simulation study, we used the simsurv R package which helped to simulate survival times from standard parametric distributions. In our case, we used the Weibull distribution to simulate the survival times. For one group, we generated the survival times using the Weib (1.2, 3.6), while for the second group, the survival times were generated from Weib (2.9, 5.4) (60 of the survival times for this group) and the remaining (40) were generated from Weib (1.5, 3.6).

For any case, we performed 5,000 simulations, and the analysis was done by R. We considered the cases of equal sample sizes in all our simulations. The notation (n1, n2) (c) has been used, where n1 = n2 represents the sample size under consideration and n1 = n2 is the number of individuals in each group and c is the overall censoring rate. The censoring rates taken into account are 20, 40, and 60 and c = 0 means that there has been no censoring. There are therefore four simulation cases for each sample size. The used sample sizes per group are 20, 50, 80, and 100.

3.2. Simulation Results

To make it more visible and separate, we look at the following plot in Figure 2 which shows graphically the variations in power as obtained in Table 2. To read the plots well, NoCens100 stands for the case of no censoring in the case of a sample size of 100 individuals per group. It is the same for 80, 50, and 20. Cens10020 stands for the case of 100 individuals per group with the overall censoring rate of 20 and the same analogy applies to others.

The new test may be recommended as an alternative of test while aiming at the detection of late differences between treatments. It imposes itself as a good choice when the sample size becomes smaller. In other words, the new test outperforms the existing ones for small sample sizes . To see this more clearly, we used the relative efficiencies of all tests (in power) compared to the standard log-rank test. We will mainly look at FH (1, 1) and FH (0, 1), and the new test looks to be relatively more efficient. In regard to the efficiency of the tests, we evaluate them relatively to the standard log-rank test. This last is known to perform better in the case of proportional hazards but still keeps some level of power in other scenarios. Even in our case of late differences detection, it was the third choice after being outperformed by our newly proposed test. Table 3 shows the heatmap of the relative efficiencies of other tests at all levels of censoring under consideration with respect to the LR test.

As seen on Figure 3, the graph at the left side is a random simulation for sample size n = 100, while the right one is for n = 20, and the censoring rate is 20 in both cases. As it can be seen, the FH (0, 1) weight in dashed red increases gradually and this justifies its high power for late differences. The separation of curves usually happens gradually, and hence, as the difference becomes higher, the FH (0, 1) weight becomes higher too.

For the new weight in solid blue, there is only a brutal increase in a very small number of time points at the end, while it is relatively very small since the beginning of the study. This behavior can help us to justify its efficiency for the case of small sample size because in such case, the late separation does not take longer, and hence, the new weight will not lose many event times of the separation. Apart from this, the new weight could be powerful in case of brutal separation in the very last few event times, and this may not happen often practically. However, again, in the few cases of strength, the new weight can reach to numbers above 1 as seen on the graph at the right where it even reached 2 at one last point. It is clear that the relative weakness of the new weight for large sample sizes resides in that fact of failing to capture some event times at the beginning of the separation which may normally start around the middle time of the study and remain sensitive to a very limited number of last event times as both graphs show. In contrast, FH (0, 1) captures gradually all separation since their occurrence as shown by its gradual shape or gradual increase. We recall that where the new weight drops to 0 is when the number at risk in one of the groups becomes 0 because there is no comparison at such points and onward. To make it well understood, assume the separation happened at the time point 30 (graph at the left). We can see how much is the difference between the two weights since then and hence the loss of power for the new weight. For the graph at the right, if the separation started from time point 25, for example, we notice that the difference between the two weights is not that high as at the left side case. But again, we may highlight that the new weight is very strict on the very late few event times with exceptionally higher weights. The lower loss of power for the new weight in the case of censoring resides in the fact that this last reduces the number of event times, and because the new weight needs just the very last few event times, it does not lose too much power as the FH (0, 1) which might have benefited from many event times since the beginning of the separation. This is why small sample size cases and heavy censoring cases are the favoring ones for the new weight which needs just fewer last event times than FH (0, 1). This is not strange because every weight function has some circumstances when it excels in power but fails in others. Our newly suggested weight is then powerful in heavy censoring and/or small sample size cases.

3.3. Discussion

As shown by the heatmap, we have three relatively powerful tests when compared to the standard log-rank test. Those are FH (1, 1), FH (0, 1), and the new test. FH (1, 1) is more powerful than LR when the censoring rate is higher than 50 since the relative efficiency has been more than 150 only in the case where the censoring was 60. For the cases of no censoring and for those of censoring of 20, the test has not been more efficient than the standard LR test irrespective of the sample size under consideration.

The FH (0, 1) test, which is usually known to be the most powerful for late differences, still keeps its power, but it becomes outperformed by the newly proposed test for small sample sizes, that is, for . We can take two extreme points for the two tests. For with no censoring, the RE of FH (0, 1) was 175 while it was 150 for the new test. This implies that the difference in relative efficiency is 25 (or we can say that FH (0, 1) is 25 more relatively powerful than the new test when both are compared to the LR for n = 100.)

For with the censoring rate of 60, the RE of FH (0, 1) is 356, while it is 428 for the new test, and this implies that the new test is 72 relatively more powerful than FH (0, 1) when both tests are compared to the LR test.

So, we can see that the new test will make a higher difference in relative efficiency where it is relatively powerful than what FH (0, 1) does in its favorable conditions. Noting the importance of sample size, the new test may be a good recommendation due to its behavior in case of small samples and heavy censoring.

To get a more general recommendation between the two tests, we can do an unweighted sum of differences of relative efficiencies in all cases under study and see the result. That is, we take the relative efficiencies for FH (0, 1) minus those of the new test (RE (FH (0, 1))—RE (new test)) in each case and we sum up to see which one is generally relatively more efficient. Operating on the data in the heatmap, we obtain −220 in total, which shows that the new test is relatively more efficient than FH (0, 1) in general. This is immediately linked to the fact that where the new test is relatively more efficient, it makes bigger differences.

3.4. Application to Real Data

To check the reliability of the new test, we preferred using two real datasets to be sure of the comparison. Those datasets are as follows:(i)Head-and-Neck-Cancer Study by the Northern Oncology Group (NCOG)(ii)Time to infection of Kidney Dialysis Patients Data

The data from the Head-and-Neck-Cancer Study which was done by Northern Oncology Group (NCOG) are found in Efron [19] and have been reused by many other authors including Qian and Zhou [15] recently. Arm A represents patients who underwent radiation therapy and those who underwent radiation plus chemotherapy were put in Arm B.

For the second dataset of time to infection of kidney dialysis patients, it is a (built-in) dataset found in R under KMsurv package. The group was formed referring to the methods for placing catheters in kidney dialysis patients. Surgically placed catheter made group 1 and percutaneously placed catheter made group 2. The plot of Kaplan–Meier curves for both datasets is shown.

From Figure 4, we notice that for NCOG data, the curves are closer to each other at the beginning but separate later where Arm B appears to have higher survival probabilities than Arm A. The two-sided values for the nine tests have been computed and are given in table.

As seen from the values in Table 4, the newly proposed test showed itself as stronger than any others as it has the smallest value of 0.0129, followed by the Fleming–Harrington (FH (0, 1)) with value = 0.0223 and lastly by the standard log-rank test with value = 0.047. This is in accordance with the simulation results even though the new test seems to outperform the existing stronger test for late differences, FH (0, 1).

However, this is not strange because even the difference in the powers observed in the simulation was not that high enough that one may not hesitate to recommend this new test as a good choice. The other tests got values greater than 0.05 because they are usually known to be weak in the detection of late differences, and this is no surprising based on the shape of the two curves. Their failure or weakness to detect such difference might be from their nature. However, since the difference seems to be significant by an immediate look at the graph, if one-sided values are under consideration, the majority of all these tests could have their values to be less than 0.05, and hence, the difference might be detected. In such a case, only GW and FH (1, 0) might be the only ones to fail detecting such difference. The general observation which will remain intact is that the new test performed better than any other test in this case.

As it can be immediately observed from the KM curves for kidney data, the two survival curves crossed each other at the early stages where they were even close to each other. After crossing each other, they separated quickly, and this will lead us to the justification of the value obtained for FH (1, 1) in Table 5. It has been obtained that in addition to the two tests which were expected to detect such differences, we got another one (FH (1, 1)) which is stronger in the detection of middle differences. In other words, because it gives heavier weights to middle events and reduces as they go farther from the median time, it detected those differences in this case because in the middle of the study period, the curves had already been separated as it can be immediately seen on the graph. It is to be highlighted that this test has been surprising since it was at the point of outperforming both expected tests with the value of 0.005. However, FH (0, 1) remained the first among the three tests with the value of 0.0046 and the new test was the third with value of 0.021. Contrary to the first NCOG data, even if we had taken one-sided values, no change might have been observed on the tests with significant values.

4. Conclusion

The newly proposed test is a good alternative for the detection of late differences between survival curves. It shares the same positive behavior with FH (0, 1) of being relatively more efficient and powerful than the LR, and even though the reduction of power as the censoring rate increase is common, this reduction is relatively small for the new test compared to the remaining others (including the LR test and FH (0, 1)). The new test may, therefore, be the first choice in cases of small sample sizes and heavy censoring rates. The same strength has been observed while dealing with real datasets when the new test remains still sensitive for late differences in survival. Based on the fact that the small size of the sample and censoring are the major threats in survival analysis studies, referring to the power and higher relative efficiency of the new test in such cases, one may consider it as a better choice for late differences detection between survival curves.

Data Availability

The data used to support the findings of this study are publicly and freely available. One dataset is accessed through R software, and another is in the cited research.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The authors thank Dr. Bayowa Teniola Babalola for his assistance in conceptual understanding and Marcin Kosiński for his guidance on the programming part.