Relevant SARS-CoV-2 Genome Variation through Six Months of Worldwide Monitoring
Real-time genome monitoring of the SARS-CoV-2 pandemic outbreak is of utmost importance for designing diagnostic tools, guiding antiviral treatment and vaccination strategies. In this study, we present an accurate method for temporal and geographical comparison of mutational events based on GISAID database genome sequencing. Among 42523 SARS-CoV-2 genomes analyzed, we found 23202 variants compared to the reference genome. The Ti/Tv (transition/transversion) ratio was used to filter out possible false-positive errors. Transition mutations generally occurred more frequently than transversions. Our clustering analysis revealed remarkable hotspot mutation patterns for SARS-CoV-2. Mutations were clustered based on how their frequencies changed over time according to each geographical location. We observed some clusters showing a clear variation in mutation frequency and continuously evolving in the world. However, many mutations appeared in specific periods without a clear pattern over time. Various important nonsynonymous mutations were observed, mainly in Oceania and Asia. More than half of these mutations were observed only once. Four hotspot mutations were found in all geographical locations at least once: T265I (NSP2), P314L (NSP12), D614G (S), and Q57H (ORF3a). The current analysis of SARS-CoV-2 genomes provides valuable information on the geographical and temporal mutational evolution of SARS-CoV-2.
The global pandemic of coronavirus disease (COVID-19) is caused by severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) . The first genome sequencing of the virus was performed in humans on 07 January 2020 by Chinese scientists . The genome of the virus contains 14 ORFs (open reading frames) encoding 27 proteins . Approximately two-thirds of the viral RNA encode for 16 nonstructure proteins (NSPs) and are primarily located in the first ORF (ORF1a/b) of the viral genome. The remaining viral genome, located in the 3-terminus of the SARS-CoV-2 genome, encodes for four essential structural proteins (spike (S) glycoprotein, small envelope (E), matrix (M), and nucleocapsid (N) proteins) [3, 4] and eight accessory proteins (3a, 3b, p6, 7a, 7b, 8b, 9b, and ORF) [3, 5]. It has been widely reported that the functional domains of the viral genome are linked to virulence, infectivity, ion channel formation, and virus release and that conserved functional domains across the species (likely 3a protein) are critical in the viral life cycle .
Current genome survey data suggests that single nucleotide variants (SNVs) are abundant, and a plentiful mutational diversity may define different major clusters of viruses now circulating globally . Such observations raised the question of whether viral mutations are contributing to host tropism, changing antigenicity, and rapid global spread . A recent study reported that SARS-CoV-2 has mutated in different patients, leading to the viral genome grouping into 6 genotypes . In fact, the virus mutation rate drives viral evolution and genome variability .
The Ti/Tv ratio is a tool to measure the nucleotide mutation process and is an established component of substitution models . The Ti/Tv ratio is computed as the ratio of the number of transition SNPs (purine-to-purine or pyrimidine-to-pyrimidine) to the number of transversion SNPs (purine-to-pyrimidine or vice versa). Estimation of the Ti/Tv ratio gives us a deeper understanding of the process of molecular evolution, which can help in modeling of the evolutionary process. The Ti/Tv ratio is also widely used [11, 12] as a quality control parameter for assessing the overall SNP quality.
Overall, close monitoring of the SARS-CoV-2 genome is important for identifying possible mutant strains and the potential impact of these mutations on the virulence and propagation of the virus between populations worldwide . Genome efforts to track SARS-CoV-2 sequences have globally helped identify worrying variants.
Since December 2019, the GISAID database (https://www.gisaid.org) has provided a compilation of thousands of complete and partial SARS-CoV-2 genomes, contributed by clinicians and researchers from across the world. These sequencing data are useful in tracking intercontinental variation and the interpersonal evolutionary dynamics of the virus. Assessment of the virus sequence can provide scientific evidence to guide drug and vaccine development.
In the present investigation, we compared the mutational events of SARS-CoV-2 in six geographical areas (Africa, Asia, Europe, North America, Oceania, and South America) based on viral genome data available in the GISAID database, collected within a period of 6 months, from January to June 2020.
2. Materials and Methods
2.1. Source of Sequences and Exclusion Criteria
Genome sequences of SARS-CoV-2 were retrieved from the GISAID database (https://www.gisaid.org/) separately for six geographic locations (Africa, Asia, Europe, North America, Oceania, and South America) and for six periods of genome collection (01-31 January 2020, 01-29 February, 01-31 March, 01-30 April, 01-31 May, and 01-30 June). A full acknowledgment table of those labs which generated and uploaded data is provided as a supplementary file (available here). The sequences were filtered according to the following criteria: human host, complete genome (only genomes with >92000 pb), and high coverage (genomes containing less than 1% Ns and less than 0.05% of unique amino acid mutations without insertions or deletions). Only high-quality genomes with precise collection times were selected.
2.2. Mapping and SNP Calling
The SNP profile is a collection of SNP mutations with the corresponding nucleotide changes and positions in a genome. For each dataset, whole-genome sequences in FASTA format were mapped to reference genome MN908947.3, using the standard Geneious algorithm (Geneious Prime software 2020.2.1). To preserve sequences’ site information, we trimmed end positions 1 to 55 and 29804 to 29903 as they are characterized by low coverage and a high rate of apparent sequencing or mapping errors. The “find variations/SNPs” tool in the Geneious software was used for SNP calling using reference genome NC045512.2 annotations obtained from the NCBI database. The data files in the CSV format were exported to an Excel file for further analysis.
2.3. Statistical Data Analysis
All statistical analyses and generation of plots were performed via RStudio software (version 1.3.959) using various package tools. Categorical variables were expressed as absolute frequencies and percentages. The nonparametric Wilcoxon test was used to compare the distribution of variant frequencies between the transition and transversion mutations, and the chi-squared test was used to compare the proportions of transition and transversion mutations between locations for each period and between periods for each location. A value of less than 0.05 was considered statistically significant.
3.1. Variant Filtration
A total of 42523 SARS-CoV-2 genomes derived from six geographical locations were fully analyzed from January 2020 to June 2020 (Table 1). The analysis showed a total of 23202 variants, which upon comparison with the reference genome showed 55% as transition mutations and 45% as transversion mutations. Upon analyzing all the variants, we observed that the Ti/Tv ratio was very low () (Table 1). However, as we increased the cutoff value of variant frequencies, this ratio increased. At a cutoff of 1%, the Ti/Tv ratio became 3.2. At this level, only 55 variants were detected (79% were transition mutations, and 21% were transversion mutations). For the next step of our analysis, we considered only variants observed with a frequency over 1% in each dataset.
3.2. Characterization of SARS-CoV-2 Variants Geographically over Time
To understand the evolution of SARS-CoV-2 over the first six months of the pandemic, we analyzed data separately for each geographical location by categorizing the data according to the timing of sample collection as recorded in the GISAID database. As shown in Table 2, the largest proportion of genomes came from Europe (24326 samples, 57.2%), followed by North America (10878 samples, 25.6%), Asia (4305 sample, 10.1%), and Oceania (1904 samples, 4.5%), while genome sequences from Africa and South America accounted for only 2.6%, with 569 and 541 samples, respectively.
By focusing on hotspot mutations with frequencies over 1% in each dataset to filter out possible recurrent sequencing errors, we observed that transition mutations generally occurred more frequently than transversion mutations. Comparison of mutational characteristics of viral genomes from various geographical locations (Table 2) showed that the Ti/Tv ratios ranged from 2 to 5.5 and that the most common nucleotide change was the C>U transition (data not shown). The proportion of the latter ranged usually from 40% to 59%. The most common transversion was G>U whose proportion ranged from 9.23% to 27%. When comparing the distribution of Ti and Tv mutations for each geographical location at different time periods, we noticed no significant difference.
We then checked if the distribution of Ti and Tv mutations is associated with an increase in variant frequencies (Figure 1) and noted that there were more transversion mutations with low frequencies than with high frequencies. The Wilcoxon test showed a significant difference in the distribution of mutation frequencies between the Ti and Tv mutations in Asia in the February 2020 period with a value of 0.0132. A significant difference was also observed in Oceania in May 2020 (). For the rest of the datasets, there was no significant difference between the Ti and Tv mutations according to mutant frequencies.
3.3. Mutational Patterns of SARS-CoV-2
To study the mutational patterns of SARS-CoV-2 across geographical locations, we selected the most abundant mutations defined as having a frequency of 10% or greater in at least one dataset. Heatmap analysis was performed to visualize the shared groups of mutations and to cluster them based on their frequency changes in chronological order (Figure 2). We found that four mutations (A23403G, C14408T, C241T, and C3037T) always cooccurred and showed the same clustering pattern in all geographical locations over time.
Our clustering analysis revealed that each geographical location was characterized by different mutational patterns and some of them were continuously evolving in the world. In Africa, only two groups can be clearly distinguished: the group of four mutations cited above and one group of 12 mutations without a clear temporal pattern. In Asia, the number of hotspot mutations was relatively low during January and February, but after March, more mutations were detected and more patterns were observed, with 4 groups of mutations. Interestingly, we noticed a cooccurrence of two mutations (C8782T and T28144C) that had appeared early in Asia and their frequencies declined thereafter. In Europe, two mutations clustered together (G11083T and G26144T) appeared with relatively higher frequency during the initial periods as compared to the later periods. We noted that the variant GGG mutated to AAC at position 28881–28883 and reached a frequency of more than 10% in March that increased over time. We identified another group of 5 mutations (C1059T, C14805T, G25563T, G10097A, and C2373T) without a clear pattern. Two of them were detected later during the May to June period (G10097A and C2373T). In North America, there were two identified groups of mutations, each divided into two subgroups. The first group was characterized by mutations with increasing frequencies over time, one subgroup of two mutations (G25563T and C1059T) and another subgroup of four mutations (A23403G, C14408T, C241T, and C3037T). The second group featured one subgroup with a large number of mutations (16 mutations) that appeared later, while the other subgroup featured mutations that emerged in the beginning, and then their frequencies declined during the May to July period. In Oceania, three mutational groups were identified. The first group was marked by the four mutations noticed in the other geographical locations. The second group was characterized by 35 mutations without a clear pattern, and the third group contained 8 mutations that emerged beginning in March (C1059T, G25563T, C18555T, A1163T, T7540C, G16647T, G22992A, and G2340A).
In South America, in addition to the clustering of A23403G, C14408T, C241T, and C3037T, we distinguished two groups of mutations, one of which featured 17 mutations gathered in the same cluster, while the other group contained the variant GGG mutated to AAC at position 28881–28883.
3.4. Nonsynonymous Substitutions and Associated Amino Acid Changes
Our analysis revealed a difference between geographical locations in the number of sites possessing hotspot nonsynonymous mutations with at least once (Figure 3). The distribution of these mutations varied widely among different proteins of SARS-CoV-2. The highest number of mutations was observed in Oceania (), followed by Asia (), North America (), and South America (). The lowest number was observed in Europe and Africa with 8 mutations observed for both regions. About 47% of these mutations were observed only once. Four hotspot substitutions were found in all locations at least once: T265I (NSP2), P314L (NSP12), D614G (S), and Q57H (ORF3a). In Oceania, we have also observed three substitutions (S477N, G485R, and N501Y) in addition to the D614G substitution on the S protein.
The use of advanced sequencing technologies has the potential of producing multiple-time point whole-genome data, which provides insight into the evolution of the SARS-CoV-2 genome during the COVID-19 pandemic in each geographical area, and in identifying a comprehensive list of candidate adaptive mutations for this stage of the pandemic. It is important to note that the viral genome data available are geographically biased in favor of regions performing extensive sequencing. The highest number of samples was observed in Europe, followed by North America, Asia, and then Oceania, while the lowest data were observed in Africa and South America.
Comparison of whole-genome samples with the reference genome revealed a number of mutations occurring mostly at low frequencies. Mutations with very low frequencies are likely due to errors in the NGS procedure rather than true variants in viral strains [14, 15]. This can be explained by the fact that Ti/Tv ratios increase when variants with low frequencies are removed. During six months of monitoring, we found 55 variants that have been detected with a frequency greater than 1%. However, Callaway  reported that a typical SARS-CoV-2 virus accumulates only two single-letter mutations per month in its “genome.”
In practice, Ti/Tv ratios can be used to determine which threshold should be used in QC data . Rayko and Komissarov  have reported a lower Ti/Tv ratio in the unique genome variation of SARS-CoV-2. In the case of a random distribution of Ti and Tv mutations (i.e., without any biological influence), a ratio of 0.5 would be expected, simply due to the fact that there are twice as many Tv mutations possible as Ti. However, in the biological context, a bias of Ti versus Tv is generally observed as a function of unequal base frequencies. In fact, this Ti-Tv substitution bias has been noted in both the eukaryotic and prokaryotic genomes [18, 19]. Using a cutoff filtering frequency of 1% and separately analyzing our data by geographical location and time period, we generally observed a Ti versus Tv bias, which was previously reported by some authors .
Generally, the predominant substitution was C>U transition, which might be intervened by cytosine deaminases . Multiple groups have observed the predominance of C to T (U) substitutions in SARS-CoV-2 [22, 23]. The high frequency of C>U transitions likely reflects the virus adaptation processes in its hosts . Surprisingly, G>U transversions were also frequent. Panchin and Panchin  reported a 9-fold excess of G>U transversions among SARS-CoV-2 mutations over relative substitution frequencies between SARS-CoV-2 and a close relative coronavirus from bats (RaTG13), suggesting that the mutational patterns of SARS-CoV-2 could have changed after transmission to humans.
When comparing the distribution of the proportions of transition and transversion mutations between different geographical locations and time periods, we noted that the bias in transition versus transversion mutation in SARS-CoV-2 is not associated with geographical and temporal effects.
Heatmap analysis (Figure 2) revealed remarkable hotspot mutation patterns for SARS-CoV-2. Mutations were clustered based on how their frequencies changed over time according to each geographic location. We noticed that some clusters show a clear increase or decrease in mutation frequency. However, many mutations appeared in specific periods without a clear pattern over time. Other studies reported that only a handful of clusters were prevalent in the early stage of the pandemic, and even some mutations arose independently. As shown in our study, continuous mutations accumulate in transmitted strains, and the emergence of small clusters was replaced quickly, whereas others gradually became dominant because the mutations were fixed, such as the D614G mutation in the RBD region. It is worth noting that the D614G mutation at position 23403 always accompanies the three most frequent mutation sites in the ORF1ab (14408 in NSP12 and 3037 in NSP3) region, as well as the mutation at position 241. The four mutations occurred probably during the transition between the first cluster reported in Wuhan and the subsequent clusters that spread globally. Moreover, researchers reported that the increase of these mutations was the result of a fitness advantage rather than a genetic drift  of the virus by increasing its infectivity . We assume that the cooccurrence of this group of mutations can functionally cooperate in the stability, transmission, and adaptability of the virus .
Another group of mutations showing a constant increase in frequency was the three adjacent nonsynonymous mutations in the N protein, and the nucleotide sequence GGG changed to AAC, resulting in a change of amino acid from RG to KR (AGGGGA coding for RG changed to AAACGA coding for KR). This group of mutations was identified first in February in Europe and then in other locations.
Two mutations clustered together in North America (G25563T and C1059T) appeared in March with a medium frequency and remained so until June. Both mutations also appeared in all geographical locations in at least one period but with lower frequency than in North America. The C1059T mutation causing amino acid substitutions (T265I) is in gene regions of NSP2 (ORF1ab), whereas the G25563T mutation is on the gene region of ORF3a, which encodes for the largest protein in the SARS-related CoV accessory family proteins. The product of ORF3a is a unique membrane protein and is essential for the pathogenesis of the disease [2, 6].
An interesting mutational pattern was observed mainly in geographical locations performing extensive sequencing. It concerns mutations that appeared early and whose frequencies subsequently declined. In Asia, two mutations cooccurred, the C8782T silent mutation in the NSP4 gene and the T28144C nonsynonymous mutation (L84S) in the ORF8 gene. There is a controversial debate about the ancestry of L and S types of this mutation and its functional impact [28, 29]. During the initial stages of the outbreak in Wuhan, a decrease in the frequency of L-type mutations was observed after January 2020, suggesting that 84S may exhibit some advantages over 84 L. Studies suggest that 84S may lead to structural disorders in the C-terminus of the protein and may also produce a new phosphorylation target for serine/threonine kinases of mammalian hosts.
The same pattern was observed in Europe for two transversion mutations: the G11083T, leading to an L3606F change within the NSP6 protein in ORF1a, and the G26144T, leading to a G251V change in ORF3a. The frequency of the G251V substitution was estimated at 48% in Europe in January and decreased to 13% in March (Figure 3), while the emergence of this substitution is consistent with the lockdown of Wuhan on 23 January 2020 .
Along with the two mutations observed in Asia (C8782T, T28144C), a cluster of three other mutations (C18060T, C17747T, and A17858G) stood out in North America. The complete disappearance of this cluster was observed during the May to June period. The C18060T change is a silent mutation in the ORF1b gene. However, the two other mutations are nonsynonymous, leading to P1427L and Y1464C changes in NSP13 (helicase) which catalyzes the unwinding of duplex oligonucleotides into single strands. According to Guan et al. , the substitution of proline with leucine (P1427L) is not expected to create noticeable effects. Y1464 is part of a region that contributes to the binding and unwinding of duplex oligonucleotides . Its substitution by cysteine would decrease the stability and enhance the dynamics of this particular region and possibly affect the binding and processing of RNA .
Various nonsynonymous mutations were observed during the monitoring period. These can be due to genetic evolution and adaptation including the infectivity and pathogenicity of SARS-CoV-2. However, minimal mutations in the genome, represented by a single D614G mutation in the S protein, have the capacity to alter the traits of a protein, affecting the virus infectivity and clinical outcomes, as well as the epidemiology of the virus [16, 32]. The D614G mutation in the spike (S) protein and P314L in the nonstructural protein 12 (NSP12) are consistently related and common in all geographic locations with increasing frequency over time. The spike region determines the specific binding of SARS-CoV-2 to the host receptor and the initiation of viral replication; this region is reported to be the most potent and essential for viral attachment and entry into the host cells . Eaaswarkhanth et al.  speculate that the S D614G strains could be more virulent, increasing the severity of the virus in infected individuals, especially in Europe where this mutation is prominent . The P314L mutation found within the RNA-dependent polymerase may play a causal role in enhanced viral replication and therefore should be considered for potential contribution to infectivity .
The SARS-CoV-2 virus is continuously evolving and has already formed heterogenic clusters. In addition, several fixed mutations have also been observed in different continents. These mutations can be essential for the adaptation of the virus to the human host. T265I and Q57H substitutions are likely to be associated with pathogenesis. Indeed, NSP2 involves mitochondrial biogenesis and intracellular signaling while ORF3a can induce cell apoptosis [37, 38]. In the spike protein region, three other important nonsynonymous mutations other than D614G have been observed in Oceania (S477N, G485R, and N501Y). This requires further investigation, in particular for possible vaccine application. Amino acid changes may play an important role in increasing the virulence of viral strains by inducing conformational changes in discontinuous neutralizing epitopes .
Some mutations could be the result of the virus adapting to specific environmental conditions in a given geographic area, such as the climate . However, the potential geoclimatic effects on the mutations observed must be evaluated by clinical and/or experimental studies.
4.1. Study Limitations
This study has some limitations, such as the lack of clinical patient data, as well as unbalanced sample sizes between different geographic areas.
In addition, the sequenced viral genomes are not randomly selected from the global population and are therefore susceptible to bias.
In summary, our data provide valuable information on the geographic and temporal genome evolution of SARS-CoV-2. The mutations observed vary according to the geographical distribution and the period of monitoring of the virus. This variation would be influenced by the virus-host interaction. Beside these mutational events resulting in the transformation to a more virulent strain, there are a number of highly conserved regions in the SARS-CoV-2 genome which could be utilized as potential targets for inhibitory drugs and in vaccine development. Effective and timely genome surveillance of viral sequences is worthwhile for effective prevention and control.
Related supplementary data of the current manuscript are available upon request.
Conflicts of Interest
The authors declare that they have no competing interests.
All authors contributed at different levels to analyzing data, preparing references, and writing, revising, and editing the manuscript. The contributions of the authors involved in this study are as follows: Abdelmalek HAKMAOUI: conceptualization, methodology, software, original draft preparation, and data analysis; Faisal KHAN: manuscript cowriting and validating; Abdelhamid LIACINI: manuscript reviewing; Amanjot KAUR: manuscript cowriting, reviewing, and validating; Yacine BERKA: manuscript editing; Safaa MACHRAOUI: reference management; Hafid SOUALHINE: validation of data and manuscript reviewing; Noureddine BERKA: manuscript reviewing and coordination; Hanane RAIS: resources and cosupervision; and Brahim ADMOU: supervision and manuscript reviewing and editing.
The authors warmly thank the laboratories which submitted the genome sequences of SARS-CoV-2 in the GISAID EpiCoV database, sequences which allowed the conduct of this research. A detailed list of contributing labs to GISAID is available in the supplementary file.
Acknowledgment table. (Supplementary Materials)
I. N. Abdullahi, A. U. Emeribe, J. O. Mustapha et al., “Exploring the genetics, ecology of SARS-COV-2 and climatic factors as possible control strategies against COVID-19,” Le Infezioni in Medicina, vol. 28, no. 2, pp. 166–173, 2020.View at: Google Scholar
C. T. Cornillez-Ty, L. Liao, J. R. Yates, P. Kuhn, and M. J. Buchmeier, “Severe acute respiratory syndrome coronavirus nonstructural protein 2 interacts with a host protein complex involved in mitochondrial biogenesis and intracellular signaling,” Journal of Virology, vol. 83, no. 19, pp. 10314–10318, 2009.View at: Publisher Site | Google Scholar