Computational and Mathematical Methods in Medicine

Volume 2018, Article ID 6490647, 9 pages

https://doi.org/10.1155/2018/6490647

## Euclidean Distance Analysis Enables Nucleotide Skew Analysis in Viral Genomes

^{1}Laboratory of Experimental Virology, Medical Microbiology, Amsterdam UMC, University of Amsterdam, Amsterdam, Netherlands^{2}Research Institute of Child Development and Education, University of Amsterdam, Amsterdam, Netherlands^{3}Medical Microbiology, Amsterdam UMC, University of Amsterdam, Meibergdreef 9, 1105 AZ Amsterdam, Netherlands

Correspondence should be addressed to Formijn van Hemert; ln.avu.cma@tremehnav.j.f

Received 29 May 2018; Revised 27 September 2018; Accepted 8 October 2018; Published 30 October 2018

Academic Editor: Emil Alexov

Copyright © 2018 Formijn van Hemert et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Nucleotide skew analysis is a versatile method to study the nucleotide composition of RNA/DNA molecules, in particular to reveal characteristic sequence signatures. For instance, skew analysis of the nucleotide bias of several viral RNA genomes indicated that it is enriched in the unpaired, single-stranded genome regions, thus creating an even more striking virus-specific signature. The comparison of skew graphs for many virus isolates or families is difficult, time-consuming, and nonquantitative. Here, we present a procedure for a more simple identification of similarities and dissimilarities between nucleotide skew data of coronavirus, flavivirus, picornavirus, and HIV-1 RNA genomes. Window and step sizes were normalized to correct for differences in length of the viral genome. Cumulative skew data are converted into pairwise Euclidean distance matrices, which can be presented as neighbor-joining trees. We present skew value trees for the four virus families and show that closely related viruses are placed in small clusters. Importantly, the skew value trees are similar to the trees constructed by a “classical” model of evolutionary nucleotide substitution. Thus, we conclude that the simple calculation of Euclidean distances between nucleotide skew data allows an easy and quantitative comparison of characteristic sequence signatures of virus genomes. These results indicate that the Euclidean distance analysis of nucleotide skew data forms a nice addition to the virology toolbox.

#### 1. Introduction

Nucleotide skew analysis [1] provides a powerful tool to visualize compositional aspects of a DNA/RNA sequence. For instance, the minimum and maximum of a G vs C skew can be used to predict the origin of replication and the location of the terminus, respectively, in prokaryotic genomes [2–4]. The GenSkew algorithm [1] calculates ratio values of the six nucleotide combinations (C vs G, G vs A, U vs G, U vs A, C vs A, and U vs C) in predefined windows and steps along a sequence. Ratio values are calculated according to (N1-N2)/(N1+N2), and hence, a positive value indicates that N1 wins over N2. Skew graphs are generally created by plotting the subsequent windows as numbers on the *X*-axis against the corresponding cumulative skew values on the *Y*-axis. In this way, we demonstrated for a representative collection of RNA viruses that the skew plots can be interpreted as “nucleotide compositional signatures” of the viral genomes and that these characteristic signatures are more prominently observed in the single-stranded regions than that in the base-paired, double-stranded regions of a viral RNA genome [5]. Likewise, we demonstrated that purine enrichment in the Zika virus RNA genome [6] is a general property of most but not all Flaviviridae and, surprisingly, prominently observed at the first position of the codons and not the silent 3^{rd} codon position (unpublished results). It is, however, difficult, time-consuming, and nonquantitative to compare different skew graphs with respect to similarities and dissimilarities. We therefore developed a simple mathematical addition to GenSkew analysis that converts skew data into a pairwise Euclidean distance matrix, which can be formatted by means of clustering into a neighbour-joining tree, facilitating the identification of putative relationships, e.g., between viral sequences. The key result of this study is that this Euclidean algorithm offers an easy and quantitative interpretation of nucleotide skew data of virus genomes. The construction of Euclidean distance trees based on skewed nucleotide compositions does not require a prior alignment of the sequences. In contrast, “classical” phylogenetic trees are modelled after accurate nucleotide alignment of the sequences. As an additional result, we demonstrate that skew distance trees and phylogenetic trees are surprisingly similar but not identical.

#### 2. Materials and Methods

Nucleotide sequences of the single-stranded (ss) RNA genomes of coronavirus, picornavirus, and HIV (reference strains and unclassified species) were downloaded from the ViralZone database (http://viralzone.expasy.org/) [7, 8]. Flaviviridae were selected and classified according to [9, 10] (Berkhout and van Hemert, unpublished results). GenBank IDs are provided in the figures. Nucleotide skew analysis was done by means of the GenSkew algorithm [1], of which Dr. T. Rattei (Technische Universität München) kindly provided a version that is not restricted by the length of a sequence. For normalization purposes, the overlapping window size was set 1% of the sequence length with a step size of 20% of the window size. The skew between two nucleotides N1 and N2 is defined by the ratio (N1 − N2)/(N1 + N2), and hence, a positive value of this ratio indicates that N1 proportionally exceeds N2. If the N1 versus N2 comparison results in a negative skew value, the same but positive skew value is true for these nucleotides in the reverse order (N2 versus N1). Algorithms converting skew data into a pairwise Euclidean distance matrix are provided as Additional File S1. The multiple sequence alignment of coronaviral, flaviviral, picornaviral, and HIV genomes was obtained by means of MAFFT [11]. Other alignments and Neighbor Joining (NJ) skew distance trees were built in MEGA v7 [12]. Phylogenetic histories were inferred by using the maximum-likelihood method based on the general time reversible model of nucleotide substitution in the viral genomes [12]. A discrete gamma distribution (5 categories) was used to model evolutionary rate differences among sites. The tree with the highest log likelihood is shown. Randomization of the rubella virus RNA genome (10 consecutive cycles to ensure complete nucleotide randomization) was performed by means of the BioWeb server (http://www.cellbiol.com/scripts/randomizer/dna_protein_sequence_randomizer.php). All calculations were performed in Excel.

#### 3. Results

##### 3.1. Nucleotide Skew Analysis of a Single Sequence

We used the rubella virus with its G- and C-rich RNA genome (JN635295) to illustrate the skew plot analysis (Figure 1). Skew profiles are shown of the nucleotide sequence before Figure 1(a) and after 10 consecutive cycles of randomization Figure 1(b). The profiles are nearly identical because skew profiles are determined solely by nucleotide composition and not by the nucleotide sequence. The rubella virus RNA genome size is 9761 nucleotides, and hence, the window size and step size are set to 98 (1% of sequence length) and 20 (20% of window size), respectively, generating 489 overlapping windows from the 5′- to the 3′-end (*X*-axis) with the corresponding skew values cumulatively plotted on the *Y*-axis. The skew lines start by definition at the origin of the plot. It should be noted that, in skew language, CG does not represent a CG base pair but a comparison of the C with the G nucleotide proportions. We adopted the notation C versus G (C vs G) in the text and the figures. The nucleotides C and G win over A (steep positive slope), and the U-nucleotide loses from C and G (steep negative slope) by approximately the same proportion. C is slightly more prominent than G, and the proportions of U and A are close to equality. Importantly, the skew profiles are straight lines, which indicates that the virus-specific nucleotide bias is quite constant along the sequence of the RNA genome. Therefore, skew values at the ultimate 3′ end of the genome (position 489) represent a reliable measure of the pairwise nucleotide compositional bias of the rubella virus genome. For instance, the skew endpoint values of C vs A (216.24) and U vs C (−213.37) are very similar but with the opposite sign. The same is true for G vs A (169.79) and U vs G (−168.02), and the skew endpoint values for C vs G and U vs A are 56.39 and 2.31, respectively. For the rubella virus genome, these values can be considered a characteristic signature of the nucleotide composition and presented as a vector in a six-dimensional Euclidean space (C vs G, G vs A, U vs G, U vs A, C vs A, U vs C) = (56.39, 169.79, −168.02, 2.31, 216.24, −213.37).