Abstract

The productivity of researchers and the impact of the work they do are a preoccupation of universities, research funding agencies, and sometimes even researchers themselves. The h-index (h) is the most popular of different metrics to measure these activities. This research deals with presenting a practical approach to model the h-index based on the total number of citations (NC) and the duration from the publishing of the first article (D1). To determine the effect of every factor (NC and D1) on h, we applied a set of simple nonlinear regression. The results indicated that both NC and D1 had a significant effect on h ( < 0.001). The determination of coefficient for these equations to estimate the h-index was 93.4% and 39.8%, respectively, which verified that the model based on NC had a better fit. Then, to record the simultaneous effects of NC and D1 on h, multiple nonlinear regression was applied. The results indicated that NC and D1 had a significant effect on h ( < 0.001). Also, the determination of coefficient for this equation to estimate h was 93.6%. Finally, to model and estimate the h-index, as a function of NC and D1, multiple nonlinear quartile regression was used. The goodness of the fitted model was also assessed.

1. Introduction

The productivity of researchers and the impact of the work they do are a preoccupation of universities, research funding agencies, and sometimes even researchers themselves. Several metrics have been used to measure these including journal impact factors, citation counts, and publication rates. At present, however, the h-index is the most popular of these metrics [14]. Hirsch’s definition of the index is that h = m if m of a researcher’s p papers have at least m citations each and each of the other papers has no more than m citations. As a guide, Hirsch [1] suggested that a “successful” scientist would have h = 20 after 20 years of work, whereas outstanding and “truly unique” individuals would have h = 40 and h = 60, respectively, after 20 years of work. Subsequent work has shown that this is too great generalisation, if only because the h-index is highly discipline-specific and depends on circumstance, the comprehensiveness of the literature databases is used to calculate the index and many others [5, 6]. For example, very eminent mathematicians often have h < 10 and some Nobel laureates also have very small h-indices [7]. The inevitable inference is an individual’s h-index should be considered in the context of these factors and of the distribution of h for a given number of papers and citations appropriate to the individual researcher. Some researchers introduced alternative versions of the h [8]. Generally, all of the given indices consider the number of citations received by articles. Recently, scientist researchers have studied and developed theoretical models to estimate and model these indices based on other indicators, for example, based on NC [1], the total number of publications, T [9], and the total number of publications with a minimum of one citation, T1 [10]; based on NC and T [4, 1115]; and based on NC, T1, and the total number of citations for the 1 most cited papers, C1 [16]. Librarians are particularly interested in using good tools to predict future individual scientific achievements. To solve this problem, Hirsch [17] indicated that the h-index acts significantly better than other alternatives including NC, T, and mean citations per paper to forecast future scientific achievement. It has been shown that the h-index is better than other alternatives to predict productivity.

This research deals with a statistical approach to model the h-index based on NC and D1. Simple and multiple nonlinear regressions and multiple nonlinear quartile regression were applied to estimate and predict the h-index based on NC and D1. The results were also compared to the results of simple and multiple linear regressions (SLR and MLR) models as common methods.

2. Methodology

This section is devoted to discussing the details of data collection, samples, and statistical techniques that have been applied to analyze the dataset.

2.1. Data Collection

The dataset of this work contains the information of articles for 29470 Iranian scientists that have been indexed in Google Scholar.

2.2. Data Analysis

Statistics and data mining are popular approaches to extract knowledge from the dataset. These approaches contain different data analysis techniques such as descriptive statistics [1822], regression models [2329], time series models [3043], and clustering analysis [44]. In this research, the data gathered from Google Scholar were fed and analyzed using the SPSS 25 and R 3.3.2 software.

The descriptive statistics of research variables contained minimum, maximum, mean, standard deviation, and quartiles are reported in Table 1. As Table 1 indicates, the means of h, NC, and D1 for Iranian scientists are 5.74, 248.78, and 7.98, respectively. Also, the value of h for at least 25%, 50%, and 75% of them is at most 2 (Q1 = 2), 4 (Q2 = 4), and 7 (Q3 = 7), respectively.

To determine the effect of every factor (NC and D1) on h, we applied a set of simple nonlinear regression. Also, to investigate the simultaneous effects of NC and D1 on h, multiple nonlinear regressions were applied. Finally, we divided the observations into 4 groups as follows: the first group: observations with the second group: observations with the third group: observations with the fourth group: observations with Then, to model and estimate the h based on NC and D1, the multiple nonlinear quartile regression (MNLQR) was used. The goodness of applied models was also evaluated by the coefficient of determination (R2) and comparing actual values with predicted values. The accuracies of the models were investigated using the five-fold cross-validation. In other words, the dataset was divided into five parts. In each step, four parts were considered as training data and the other part was considered as testing data. The models were trained using training data and the trained models were applied on testing data. Finally, the discrepancy between the predicted h and the true h were evaluated using different measures such as R2, root mean square error (RMSE), and mean absolute error (MAE).

2.2.1. Simple Nonlinear Regression

To model a quantitative response variable based on a predictor variable simple nonlinear regression (SNLR) model is a powerful technique. The general equation of SNLR is presented bywhere , , and are model parameters and is the random error. The estimated equation of the SLR model is given bywhere , , and are estimations of , , and respectively.

2.2.2. Multiple Nonlinear Regression

To model a quantitative response variable based on predictor variables multiple nonlinear regression (MNLR) is a powerful technique. The general equation of MNLR with two predictors and is presented bywhere , …, are model parameters and is the random error. The estimated equation of the MNLR model is also given bywhere , …, are estimations of , …, and is the estimated value of .

2.2.3. Multiple Nonlinear Quartile Regression

In multiple nonlinear quartile regression (MNLQR), first, the quartiles of response variable have been computed. Then, based on the values of quartiles, the observations were categorized into 4 distinct categories. Finally, a separate MNLR was fit for each category.

3. Results

The SNLR results to predict the separate effects of every factor (NC and D1) on h are given in the first section. Section 2 is in regard to the MNLR results to investigate the simultaneous effects of NC and D1 on h. Section 3 reports the MNLQR results to model the effects of factors on h, in each quartile.

3.1. SNLR Results

This part is to study the impact of each factor (NC and D1) on h. In this research, h was the response variable. Also the variables NC and D1 were continuous predictors. Tables 2 and 3 summarize the results of SNLR models for the variables NC and D1. As Table 2 indicates, NC and D1 factors had a significant effect on h ( < 0.001). Figure 1 also shows the plot of the fitted curve with data.

Table 3 shows the parameter estimates of SNLR models for NC and D1, respectively. Based on the results of Table 3, we can estimate h as a function of NC and D1, byrespectively. Also, the R2 values for these equations to estimate h were 93.4% and 39.8%, respectively.

Table 4 shows the results of SLR as a comparative method. Based on the results of Table 4, we can estimate h as a function of NC and D1, byrespectively. Also, the R2 values for these equations to estimate of h were 56.9% and 24.9%, respectively. As it can be observed, the SNLR method acts better than the SLR method.

Table 5 summarizes the results of five-fold cross-validation. The results confirm that the SNLR method acts better than the SLR method.

Figure 2 and Table 6 show the plot of actual values versus predicted values and the correlations between them. As can be seen, the SNLR model based on NC had a better fit.

3.2. MNLR Results

This part is to study the simultaneous impacts of NC and D1 on h. Tables 7 and 8 summarize the results of the MNLR model. As Table 7 indicates, the NC and D1 factors had a significant effect on h ( < 0.001). Table 8 shows the parameter estimates of the MNLR model.

Based on the results of Table 8, we can estimate h as a function of NC and D1, by

Also, the R2 value for this equation to estimate h was 93.6% that is not significantly greater than 93.4% .

Table 9 shows the results of MLR as a comparative method. Based on the results of Table 9, we can estimate h as a function of NC and D1, byrespectively. Also, the R2 value for this equation to estimate h was 72.2%. As it can be observed, the MNLR method acts better than the MLR method.

Table 10 summarizes the results of five-fold cross-validation. The results confirm that the SNLR method acts better than the SLR method.

Figure 3 and Table 11 show the plot of actual values versus predicted values and the correlations between them. As can be seen, the MNLR model can nicely estimate the values of h.

3.3. MNLQR Results

This part is to study the simultaneous impacts of NC and D1 on different quartiles of h. Based on the results of Table 12, we can conclude that the NC and D1 factors had a significant effect on h ( < 0.001), in every category. Based on the results, h can be estimated as a function of NC and D1, byin categories 1 to 4, respectively.

4. Conclusion

This research dealt with a statistical approach to model the h -index (h) based on the total number of citations (NC) and the duration from the publishing of the first article (D1). To determine the effect of every factor (NC and D1) on h, we applied a set of simple nonlinear regression. The results indicated that both NC and D1 had a significant effect on h ( < 0.001) and we can estimate h as a function of NC and D1, byrespectively. Also, the R2 values of these equations to estimate h was 93.4% and 39.8%, respectively, which verified that the model based on NC had a better fit.

Then, to record the simultaneous effects of NC and D1on h, multiple nonlinear regression was applied. The results indicated that NC and D1 had a significant effect on h ( < 0.001) and we can estimate h as a function of NC and D1, by

Also, the R2 value of this equation to estimate h was 93.6% that was not significantly greater than 93.4% .

Finally, to model and estimate h, as a function of NC and D1, the multiple nonlinear quartile regression was used. The goodness of the fitted model was also assessed.

As an important result, since the h-index is significantly affected by D1, it is suggested to adjust the h-index based on D1 or the times that the papers are published. Moreover, because the previous studies have verified the impact of the number of authors (NA) of the papers on the h-index, hence it is also suggested to adjust the h-index based on NA. As a good path for future works, the authors suggest defining a new measure asto measure the productivity of researchers, where n is the number of papers, is the time when the paper i has been published (based on years), is the number of citations for the paper i, and is the number of authors for the paper i.

Data Availability

The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.