A Statistical Approach to Model the <i>H</i>-Index Based on the Total Number of Citations and the Duration from the Publishing of the First Article

Mahmoudi, Mohammad Reza; Rahmati, Marzieh; Mansor, Zulkefli; Mosavi, Amirhosein; Band, Shahab S.

doi:https://doi.org/10.1155/2021/6351836

Complexity

On this page

Abstract Introduction Results Conclusion Data Availability Conflicts of Interest References Copyright Related Articles

Research Article | Open Access

Volume 2021 | Article ID 6351836 | https://doi.org/10.1155/2021/6351836

A Statistical Approach to Model the H-Index Based on the Total Number of Citations and the Duration from the Publishing of the First Article

Mohammad Reza Mahmoudi,¹Marzieh Rahmati,²Zulkefli Mansor,³Amirhosein Mosavi,^4,5and Shahab S. Band^6,7

Academic Editor: Diego R. Amancio

Received06 May 2020

Revised24 Dec 2020

Accepted19 Feb 2021

Published01 Mar 2021

Abstract

The productivity of researchers and the impact of the work they do are a preoccupation of universities, research funding agencies, and sometimes even researchers themselves. The h-index (h) is the most popular of different metrics to measure these activities. This research deals with presenting a practical approach to model the h-index based on the total number of citations (N_C) and the duration from the publishing of the first article (D₁). To determine the effect of every factor (N_C and D₁) on h, we applied a set of simple nonlinear regression. The results indicated that both N_C and D₁ had a significant effect on h ( < 0.001). The determination of coefficient for these equations to estimate the h-index was 93.4% and 39.8%, respectively, which verified that the model based on N_C had a better fit. Then, to record the simultaneous effects of N_C and D₁ on h, multiple nonlinear regression was applied. The results indicated that N_C and D₁ had a significant effect on h ( < 0.001). Also, the determination of coefficient for this equation to estimate h was 93.6%. Finally, to model and estimate the h-index, as a function of N_C and D₁, multiple nonlinear quartile regression was used. The goodness of the fitted model was also assessed.

1. Introduction

The productivity of researchers and the impact of the work they do are a preoccupation of universities, research funding agencies, and sometimes even researchers themselves. Several metrics have been used to measure these including journal impact factors, citation counts, and publication rates. At present, however, the h-index is the most popular of these metrics [1–4]. Hirsch’s definition of the index is that h = m if m of a researcher’s p papers have at least m citations each and each of the other papers has no more than m citations. As a guide, Hirsch [1] suggested that a “successful” scientist would have h = 20 after 20 years of work, whereas outstanding and “truly unique” individuals would have h = 40 and h = 60, respectively, after 20 years of work. Subsequent work has shown that this is too great generalisation, if only because the h-index is highly discipline-specific and depends on circumstance, the comprehensiveness of the literature databases is used to calculate the index and many others [5, 6]. For example, very eminent mathematicians often have h < 10 and some Nobel laureates also have very small h-indices [7]. The inevitable inference is an individual’s h-index should be considered in the context of these factors and of the distribution of h for a given number of papers and citations appropriate to the individual researcher. Some researchers introduced alternative versions of the h [8]. Generally, all of the given indices consider the number of citations received by articles. Recently, scientist researchers have studied and developed theoretical models to estimate and model these indices based on other indicators, for example, based on N_C [1], the total number of publications, T [9], and the total number of publications with a minimum of one citation, T₁ [10]; based on N_C and T [4, 11–15]; and based on N_C, T₁, and the total number of citations for the 1 most cited papers, C₁ [16]. Librarians are particularly interested in using good tools to predict future individual scientific achievements. To solve this problem, Hirsch [17] indicated that the h-index acts significantly better than other alternatives including N_C, T, and mean citations per paper to forecast future scientific achievement. It has been shown that the h-index is better than other alternatives to predict productivity.

This research deals with a statistical approach to model the h-index based on N_C and D₁. Simple and multiple nonlinear regressions and multiple nonlinear quartile regression were applied to estimate and predict the h-index based on N_C and D₁. The results were also compared to the results of simple and multiple linear regressions (SLR and MLR) models as common methods.

2. Methodology

This section is devoted to discussing the details of data collection, samples, and statistical techniques that have been applied to analyze the dataset.

2.1. Data Collection

The dataset of this work contains the information of articles for 29470 Iranian scientists that have been indexed in Google Scholar.

2.2. Data Analysis

Statistics and data mining are popular approaches to extract knowledge from the dataset. These approaches contain different data analysis techniques such as descriptive statistics [18–22], regression models [23–29], time series models [30–43], and clustering analysis [44]. In this research, the data gathered from Google Scholar were fed and analyzed using the SPSS 25 and R 3.3.2 software.

The descriptive statistics of research variables contained minimum, maximum, mean, standard deviation, and quartiles are reported in Table 1. As Table 1 indicates, the means of h, N_C, and D₁ for Iranian scientists are 5.74, 248.78, and 7.98, respectively. Also, the value of h for at least 25%, 50%, and 75% of them is at most 2 (Q₁ = 2), 4 (Q₂ = 4), and 7 (Q₃ = 7), respectively.

To determine the effect of every factor (N_C and D₁) on h, we applied a set of simple nonlinear regression. Also, to investigate the simultaneous effects of N_C and D₁ on h, multiple nonlinear regressions were applied. Finally, we divided the observations into 4 groups as follows: the first group: observations with the second group: observations with the third group: observations with the fourth group: observations with Then, to model and estimate the h based on N_C and D₁, the multiple nonlinear quartile regression (MNLQR) was used. The goodness of applied models was also evaluated by the coefficient of determination (R²) and comparing actual values with predicted values. The accuracies of the models were investigated using the five-fold cross-validation. In other words, the dataset was divided into five parts. In each step, four parts were considered as training data and the other part was considered as testing data. The models were trained using training data and the trained models were applied on testing data. Finally, the discrepancy between the predicted h and the true h were evaluated using different measures such as R², root mean square error (RMSE), and mean absolute error (MAE).

2.2.1. Simple Nonlinear Regression

To model a quantitative response variable based on a predictor variable simple nonlinear regression (SNLR) model is a powerful technique. The general equation of SNLR is presented bywhere , , and are model parameters and is the random error. The estimated equation of the SLR model is given bywhere , , and are estimations of , , and respectively.

2.2.2. Multiple Nonlinear Regression

To model a quantitative response variable based on predictor variables multiple nonlinear regression (MNLR) is a powerful technique. The general equation of MNLR with two predictors and is presented bywhere , …, are model parameters and is the random error. The estimated equation of the MNLR model is also given bywhere , …, are estimations of , …, and is the estimated value of .

2.2.3. Multiple Nonlinear Quartile Regression

In multiple nonlinear quartile regression (MNLQR), first, the quartiles of response variable have been computed. Then, based on the values of quartiles, the observations were categorized into 4 distinct categories. Finally, a separate MNLR was fit for each category.

3. Results

The SNLR results to predict the separate effects of every factor (N_C and D₁) on h are given in the first section. Section 2 is in regard to the MNLR results to investigate the simultaneous effects of N_C and D₁ on h. Section 3 reports the MNLQR results to model the effects of factors on h, in each quartile.

3.1. SNLR Results

This part is to study the impact of each factor (N_C and D₁) on h. In this research, h was the response variable. Also the variables N_C and D₁ were continuous predictors. Tables 2 and 3 summarize the results of SNLR models for the variables N_C and D₁. As Table 2 indicates, N_C and D₁ factors had a significant effect on h ( < 0.001). Figure 1 also shows the plot of the fitted curve with data.

(a)

(b)

Table 3 shows the parameter estimates of SNLR models for N_C and D₁, respectively. Based on the results of Table 3, we can estimate h as a function of N_C and D₁, byrespectively. Also, the R² values for these equations to estimate h were 93.4% and 39.8%, respectively.

Table 4 shows the results of SLR as a comparative method. Based on the results of Table 4, we can estimate h as a function of N_C and D₁, byrespectively. Also, the R² values for these equations to estimate of h were 56.9% and 24.9%, respectively. As it can be observed, the SNLR method acts better than the SLR method.

Table 5 summarizes the results of five-fold cross-validation. The results confirm that the SNLR method acts better than the SLR method.

Figure 2 and Table 6 show the plot of actual values versus predicted values and the correlations between them. As can be seen, the SNLR model based on N_C had a better fit.

(a)

(b)

3.2. MNLR Results

This part is to study the simultaneous impacts of N_C and D₁ on h. Tables 7 and 8 summarize the results of the MNLR model. As Table 7 indicates, the N_C and D₁ factors had a significant effect on h ( < 0.001). Table 8 shows the parameter estimates of the MNLR model.

Based on the results of Table 8, we can estimate h as a function of NC and D1, by

Also, the R² value for this equation to estimate h was 93.6% that is not significantly greater than 93.4% .

Table 9 shows the results of MLR as a comparative method. Based on the results of Table 9, we can estimate h as a function of N_C and D₁, byrespectively. Also, the R² value for this equation to estimate h was 72.2%. As it can be observed, the MNLR method acts better than the MLR method.

Table 10 summarizes the results of five-fold cross-validation. The results confirm that the SNLR method acts better than the SLR method.

Figure 3 and Table 11 show the plot of actual values versus predicted values and the correlations between them. As can be seen, the MNLR model can nicely estimate the values of h.

3.3. MNLQR Results

This part is to study the simultaneous impacts of N_C and D₁ on different quartiles of h. Based on the results of Table 12, we can conclude that the N_C and D₁ factors had a significant effect on h ( < 0.001), in every category. Based on the results, h can be estimated as a function of N_C and D₁, byin categories 1 to 4, respectively.

4. Conclusion

This research dealt with a statistical approach to model the h -index (h) based on the total number of citations (N_C) and the duration from the publishing of the first article (D₁). To determine the effect of every factor (N_C and D₁) on h, we applied a set of simple nonlinear regression. The results indicated that both N_C and D₁ had a significant effect on h ( < 0.001) and we can estimate h as a function of N_C and D₁, byrespectively. Also, the R² values of these equations to estimate h was 93.4% and 39.8%, respectively, which verified that the model based on N_C had a better fit.

Then, to record the simultaneous effects of N_C and D₁on h, multiple nonlinear regression was applied. The results indicated that N_C and D₁ had a significant effect on h ( < 0.001) and we can estimate h as a function of N_C and D₁, by

Also, the R² value of this equation to estimate h was 93.6% that was not significantly greater than 93.4% .

Finally, to model and estimate h, as a function of N_C and D₁, the multiple nonlinear quartile regression was used. The goodness of the fitted model was also assessed.

As an important result, since the h-index is significantly affected by D₁, it is suggested to adjust the h-index based on D₁ or the times that the papers are published. Moreover, because the previous studies have verified the impact of the number of authors (N_A) of the papers on the h-index, hence it is also suggested to adjust the h-index based on N_A. As a good path for future works, the authors suggest defining a new measure asto measure the productivity of researchers, where n is the number of papers, is the time when the paper i has been published (based on years), is the number of citations for the paper i, and is the number of authors for the paper i.

Data Availability

The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

J. E. Hirsch, “An index to quantify an individual’s scientific research output,” in Proceedings of the National Academy of Sciences, vol. 102, no. 46, pp. 16569–16572, 2005.
View at: Publisher Site | Google Scholar
T. Braun, W. Glänzel, and A. Schubert, “A Hirsch-type index for journals,” Scientometrics, vol. 69, no. 1, pp. 169–173, 2006.
View at: Publisher Site | Google Scholar
A.-W. Harzing and R. van der Wal, “A Google Scholar h-index for journals: an alternative metric to measure journal impact in economics and business,” Journal of the American Society for Information Science and Technology, vol. 60, no. 1, pp. 41–46, 2009.
View at: Publisher Site | Google Scholar
A. Schubert and W. Glänzel, “A systematic analysis of Hirsch-type indices for journals,” Journal of Informetrics, vol. 1, no. 3, pp. 179–184, 2007.
View at: Publisher Site | Google Scholar
P. Vinkler, “Eminence of scientists in the light of theh-index and other scientometric indicators,” Journal of Information Science, vol. 33, no. 4, pp. 481–491, 2007.
View at: Publisher Site | Google Scholar
S. Ruch and R. Ball, “Various correlations between the H-Index and citation rate (CPP) in neuroscience and quantum physics: new findings,” International Journal of Information Science and Management, vol. 8, no. 1, pp. 1–19, 2010.
View at: Google Scholar
A. Yong, “A critique of hirsch’s citation index: a combinatorial fermi problem,” Notices of the American Mathematical Society, vol. 61, no. 9, pp. 1040–1050, 2014.
View at: Publisher Site | Google Scholar
J. Bar-Ilan, “Rankings of information and library science journals by JIF and by h-type indices,” Journal of Informetrics, vol. 4, no. 2, pp. 141–147, 2010.
View at: Publisher Site | Google Scholar
L. Egghe and R. Rousseau, “An informetric model for the Hirsch-index,” Scientometrics, vol. 69, no. 1, pp. 121–129, 2006.
View at: Publisher Site | Google Scholar
Q. L. Burrell, “Formulae for the h-index: a lack of robustness in Lotkaian informetrics?” Journal of the American Society for Information Science and Technology, vol. 64, no. 7, pp. 1504–1514, 2013.
View at: Publisher Site | Google Scholar
W. Glänzel, “On the h-index-a mathematical approach to a new measure of publication activity and citation impact,” Scientometrics, vol. 67, no. 2, pp. 315–321, 2006.
View at: Publisher Site | Google Scholar
J. E. Iglesias and C. Pecharromán, “Scaling the h-index for different scientific ISI fields,” Scientometrics, vol. 73, no. 3, pp. 303–320, 2007.
View at: Publisher Site | Google Scholar
L. Egghe, L. Liang, and R. Rousseau, “A relation between h-index and impact factor in the power-law model,” Journal of the American Society for Information Science and Technology, vol. 60, no. 11, pp. 2362–2365, 2009.
View at: Publisher Site | Google Scholar
A. Bletsas and J. N. Sahalos, “Hirsch index rankings require scaling and higher moment,” Journal of the American Society for Information Science and Technology, vol. 60, no. 12, pp. 2577–2586, 2009.
View at: Publisher Site | Google Scholar
L. Egghe and R. Rousseau, “The Hirsch index of a shifted Lotka function and its relation with the impact factor,” Journal of the American Society for Information Science and Technology, vol. 63, no. 5, pp. 1048–1053, 2012.
View at: Publisher Site | Google Scholar
L. Bertoli-Barsotti and T. Lando, “On a formula for the h-index,” Journal of Informetrics, vol. 9, no. 4, pp. 762–776, 2015.
View at: Publisher Site | Google Scholar
J. E. Hirsch, “Does the h index have predictive power?” Proceedings of the National Academy of Sciences, vol. 104, no. 49, p. 19193, 2007.
View at: Publisher Site | Google Scholar
H. Haghbin, M. R. Mahmoudi, and Z. Shishebor, “Large sample inference on the ratio of two independent binomial proportions,” Journal of Mathematical Extension, vol. 5, no. 1, pp. 87–95, 2011.
View at: Google Scholar
M. R. Mahmoudi, J. Behboodian, and M. Maleki, “Large sample inference about the ratio of means in two independent populations,” Journal of Statistical Theory and Applications, vol. 16, no. 3, pp. 366–374, 2017.
View at: Publisher Site | Google Scholar
M. R. Mahmoudi and M. Mahmoodi, “Inferrence on the ratio of variances of two independent populations,” Journal of Mathematical Extension, vol. 7, no. 2, pp. 83–91, 2014.
View at: Google Scholar
M. R. Mahmoudi and M. Mahmoodi, “Inferrence on the ratio of correlations of two independent populations,” Journal of Mathematical Extension, vol. 7, no. 4, pp. 71–82, 2014.
View at: Google Scholar
M. R. Mahmoudi, R. Nasirzadeh, and M. Mohammadi, “On the ratio of two independent skewnesses,” Communications in Statistics-Theory and Methods, vol. 48, no. 7, pp. 1721–1727, 2019.
View at: Publisher Site | Google Scholar
M. R. Mahmoudi, M. Mahmoudi, and E. Nahavandi, “Testing the difference between two independent regression models,” Communications in Statistics-Theory and Methods, vol. 45, no. 21, pp. 6284–6289, 2016.
View at: Publisher Site | Google Scholar
M. R. Mahmoudi, M. Maleki, and A. Pak, “Testing the equality of two independent regression models,” Communications in Statistics-Theory and Methods, vol. 47, no. 12, pp. 2919–2926, 2018.
View at: Publisher Site | Google Scholar
M. R. Mahmoudi, “On comparing two dependent linear and nonlinear regression models,” Journal of Testing and Evaluation, vol. 47, no. 1, pp. 449–458, 2018.
View at: Publisher Site | Google Scholar
P. Ji-jun, M. R. Mahmoudi, D. Baleanu, and M. Maleki, “On comparing and classifying several independent linear and non-linear regression models with symmetric errors,” Symmetry, vol. 11, no. 6, p. 820, 2019.
View at: Publisher Site | Google Scholar
M. Bahrami, M. J. Amiri, M. R. Mahmoudi, and S. Koochaki, “Modeling caffeine adsorption by multi-walled carbon nanotubes using multiple polynomial regression with interaction effects,” Journal of Water and Health, vol. 15, no. 4, pp. 526–535, 2017.
View at: Publisher Site | Google Scholar
M. R. Mahmoudi, M. H. Heydari, and K.-H. Pho, “Fuzzy clustering to classify several regression models with fractional Brownian motion errors,” Alexandria Engineering Journal, vol. 59, no. 4, pp. 2811–2818, 2020.
View at: Publisher Site | Google Scholar
M. R. Mahmoudi, M. Mahmoudi, and A. Pak, “On comparing, classifying and clustering several dependent regression models,” Journal of Statistical Computation and Simulation, vol. 89, no. 12, pp. 2280–2292, 2019.
View at: Publisher Site | Google Scholar
M. R. Mahmoudi, M. Maleki, and A. Pak, “Testing the difference between two independent time series models,” Iranian Journal of Science and Technology, Transactions A: Science, vol. 41, no. 3, pp. 665–669, 2017.
View at: Publisher Site | Google Scholar
M. R. Mahmoudi, M. H. Heydari, and R. Roohi, “A new method to compare the spectral densities of two independent periodically correlated time series,” Mathematics and Computers in Simulation, vol. 160, pp. 103–110, 2019.
View at: Publisher Site | Google Scholar
M. R. Mahmoudi, M. H. Heydari, and Z. Avazzadeh, “Testing the difference between spectral densities of two independent periodically correlated (cyclostationary) time series models,” Communications in Statistics-Theory and Methods, vol. 48, no. 9, pp. 2320–2328, 2019.
View at: Publisher Site | Google Scholar
M. H. Heydari, Z. Avazzadeh, and M. R. Mahmoudi, “Chebyshev cardinal wavelets for nonlinear stochastic differential equations driven with variable-order fractional Brownian motion,” Chaos, Solitons & Fractals, vol. 124, pp. 105–124, 2019.
View at: Publisher Site | Google Scholar
M. R. Mahmoudi and M. Maleki, “A new method to detect periodically correlated structure,” Computational Statistics, vol. 32, no. 4, pp. 1569–1581, 2017.
View at: Publisher Site | Google Scholar
A. R. Nematollahi, A. R. Soltani, and M. R. Mahmoudi, “Periodically correlated modeling by means of the periodograms asymptotic distributions,” Statistical Papers, vol. 58, no. 4, pp. 1267–1278, 2017.
View at: Publisher Site | Google Scholar
M. R. Mahmoudi, M. H. Heydari, and Z. Avazzadeh, “On the asymptotic distribution for the periodograms of almost periodically correlated (cyclostationary) processes,” Digital Signal Processing, vol. 81, pp. 186–197, 2018.
View at: Publisher Site | Google Scholar
M. R. Mahmoudi, M. H. Heydari, Z. Avazzadeh, and K.-H. Pho, “Goodness of fit test for almost cyclostationary processes,” Digital Signal Processing, vol. 96, p. 102597, 2020.
View at: Publisher Site | Google Scholar
M. R. Mahmoudi, M. Maleki, K. Borodin, K.-H. Pho, and D. Baleanu, “On comparing and clustering the spectral densities of several almost cyclostationary processes,” Alexandria Engineering Journal, vol. 59, no. 4, pp. 2555–2565, 2020.
View at: Publisher Site | Google Scholar
R. Zhou, M. R. Mahmoudi, S. N. Q. Mohammed, and K.-H. Pho, “Testing the equality of the spectral densities of several uncorrelated almost cyclostationary processes,” Alexandria Engineering Journal, vol. 59, no. 5, pp. 3545–3550, 2020.
View at: Publisher Site | Google Scholar
M. R. Mahmoudi, D. Baleanu, B. Anh Tuan, and K.-H. Pho, “A novel method to detect almost cyclostationary structure,” Alexandria Engineering Journal, vol. 59, no. 4, pp. 2339–2346, 2020.
View at: Publisher Site | Google Scholar
R. Roohi, M. H. Heydari, M. Aslami, and M. R. Mahmoudi, “A comprehensive numerical study of space-time fractional bioheat equation using fractional-order Legendre functions,” The European Physical Journal Plus, vol. 133, p. 412, 2018.
View at: Publisher Site | Google Scholar
M. R. Mahmoudi, A. R. Nematollahi, and A. R. Soltani, “On the detection and estimation of the simple harmonizable processes,” Iranian Journal of Science and Technology (Sciences), vol. 39, no. 2, pp. 239–242, 2015.
View at: Google Scholar
M. Maleki, M. R. Mahmoudi, D. Wraith, and K. H. Pho, “Time series modelling to forecast the confirmed and recovered cases of COVID-19,” Travel Medicine and Infectious Disease, vol. 37, p. 101742, 2020.
View at: Publisher Site | Google Scholar
A. R. Abbasi, M. R. Mahmoudi, and Z. Avazzadeh, “Diagnosis and clustering of power transformer winding fault types by cross-correlation and clustering analysis of FRA results,” IET Generation, Transmission & Distribution, vol. 12, no. 19, pp. 4301–4309, 2018.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2021 Mohammad Reza Mahmoudi et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

1315

Downloads

940

Citations