Abstract

Categorical time series are time-sequenced data in which the values at each time point are categories rather than measurements. A categorical time series is considered stationary if the marginal distribution of the data is constant over the time period for which it was gathered and the correlation between successive values is a function only of their distance from each other and not of their position in the series. However, there are many examples of categorical series which do not fit this rather strong definition of stationarity. Such data show various nonstationary behavior, such as a change in the probability of the occurrence of one or more categories. In this paper, we introduce an algorithm which corrects for nonstationarity in categorical time series. The algorithm produces series which are not stationary in the traditional sense often used for stationary categorical time series. The form of stationarity is weaker but still useful for parameter estimation. Simulation results show that this simple algorithm applied to a DAR(1) model can dramatically improve the parameter estimates.

1. Introduction

Categorical time series are serially correlated data for which an observation at a time point is recorded in terms of a state (or a category). Some such series are continuous series which one can analyze as categorical, for example, a sequence of rainfall data in which successive days were recorded as “wet” or “dry” [1]. Other series are truly categorical in nature. Examples include geomagnetic reversals in the polarity of the earth from “normal” polarity to “reverse” polarity [2] and records of brain waves during a person's sleep using an EEG, where readings are classified into one of six possible states [3]. Regardless of the origin of the series, it is clear such series are in fact quite common, although they have received much less attention in the literature than continuous variable time series.

One quite famous categorical sequence is genomic DNA. Several papers have given attention to finding trends in DNA and testing for differences in trends between coding and noncoding sections of DNA. Two such methods are wavelet transform modulus maxima (WTMM) [4] and detrended fluctuation analysis (DFA) [5]. However, for both of these methods, the DNA series is converted into a random walk by “walking up” [𝑥𝑖=+1] or “walking down” [𝑥𝑖=1] depending on whether 𝑥𝑖 is a pyrimidine or a purine, respectively [6]. Although discretized, such a series is not categorical in the sense used for this paper; therefore, these methods, although quite useful in their context, are not appropriate for the data discussed in this research.

A categorical time series {𝑋1,𝑋2,,𝑋𝑡} is considered to be stationary if two 𝑛-tuples, {𝑋1,,𝑋𝑛} and {𝑋1+,,𝑋𝑛+}, have the same distribution for every 𝑛1 and 0 [7, 8]. This definition is too strong for most applications, so one may consider the following definition, implied in [9]. This “weak” stationarity is defined such that 𝑃(𝑋𝑡=𝑐𝑗) is constant in 𝑡=1,2,3,, for every 𝑗=1,2,,𝐶, where 𝐶 is the number of possible categories. One presumes that the correlation between two values is not dependent on the position of the values in the series; only on their distance from each other. More precisely, that 𝑃(𝑋𝑡=𝑐𝑗𝑋𝑡+=𝑐𝑘)=𝑓𝑗,𝑘(), where we define 𝐶𝑜𝑣(𝑋𝑡,𝑋𝑡+)=𝑓().

Many categorical time series are not stationary in either the strong or weak sense. As an example, Figure 1(a) shows El Nino data gathered from 1525 to 2010. Data from 1525 to 1987 were provided by [10], and the more recent data provided by [11]. In this series, 1 indicates presence of the El Niño and 0 indicates its absence. There is a distinct change in probability of El Niño occurrences around time value 290. This change in probability could be due either to better recording of events after this point (time 290 is roughly the year 1815) or to a real change in probability due to a change in weather patterns. Since the probability of an El Niño year changes quite abruptly, it is clear that these data do not fit the definition of stationarity used in categorical time series.

Figure 1(b) shows data indicating the winner of Major League Baseball's All-Star game from 1950 to 2011 [12]. An American League win is coded as 1, and a National League win is coded as a 0. The data exhibit clear signs of nonstationarity: the National League dominated until roughly the 1980s or 1990s, and the American League has dominated in the last twenty years or so. Another example (not shown) is data dealing with geomagnetic reversals of the polarity of the earth from North polarity to South polarity [2]. In that paper, the authors state that they are unable to use all of the data that they have because it is clearly nonstationary. Instead, they choose to use a portion of the data that looks stationary, according to a time plot.

For any time series, the presence of a strong trend in the data will often result in biased and imprecise estimation of parameters (and of the model itself). Therefore, it is important to remove such trends when performing model estimation. The focus of this work is examining the effects of nonstationarity on parameter estimation in categorical time series and introducing an algorithm to induce a form of stationarity in nonstationary series. In Section 2, a simple flipping algorithm is introduced, which can be applied to certain nonstationary categorical time series to make them stationary. Simulation results, which show that the correlation parameter estimator from a stationary model can be dramatically improved after applying the algorithm, are given in Section 3. However, the stationarity resulting from the flipping algorithm is not distributional stationarity, but something weaker. We define this form of stationarity and discuss its properties in Section 4. In Section 5, the detrending algorithm is illustrated with data from the sequence of league wins in Major League Baseball's All-Star game from 1950 to 2011.

2. The Flipping Algorithm

In this section, we introduce a simple algorithm which takes a nonstationary series and transforms it to one that is stationary. For simplicity, the initial focus is on series with a binary outcome, where we arbitrarily denote the categories by 0 and 1. The flipping scheme assumes that one of the categories is more common at the beginning of the series and that there is then a transition so that by the end of the series the other category is more common. Without loss of generality, the category that is more common at the beginning of the series will be labeled the 0 category. For simplicity, we will examine in detail only the case with one transition, from (01), although the algorithm can be extended to multiple transitions.

The algorithm is as follows. (1)Denote the original nondetrended series by 𝑋1,𝑋2,,𝑋𝑇.(2)Create 𝑇 new series where the 𝑘th series is created by “flipping” observations 𝑋1,𝑋2,,𝑋𝑘. For example, the first series would be the same as the original series, except that the first observation would be changed from 0 to 1 (or 1 to 0). The next series would result from flipping the first two observations, and so forth. The last series would be the complete opposite of the original series.(3)Count the number of ones in each of the 𝑇+1 series (the original series and the 𝑇 “new” ones).(4)The series with the highest number of ones is the detrended series. In case of a tie for the highest number of ones, choose the first series in the sequence with the highest number of ones (i.e., the one with the fewest flips).

As a simple example, consider the sequence 0,1,0,0,1,0,1,1. There are nine sequences to consider: the original one and the eight new ones generated by flipping as above. The sequences are given in Table 1. The first row of the table gives the original sequence and the next eight rows the generated sequences obtained from the original. There are two sequences with the maximum number of 1s, the 𝑘=4 and 𝑘=6 sequences, obtained by flipping the first four and first six, respectively. Both of these sequences have six ones. By convention, the tie is broken by using the earlier sequence (𝑘=4), as it is the “least disturbed” compared to the original one.

In general, one would apply this algorithm to sequences which exhibit a trend in the number of ones. That is, sequences for which there exists a point 𝑘 such that 0s are more common than 1s for 𝑡<𝑘 and 1s more common than 0s for 𝑡>𝑘. By design, this algorithm will then produce a sequence where 1s are more common than 0s both for 𝑡<𝑘 and 𝑡>𝑘, and hence one can say the trend has been removed.

We can also extend the algorithm to series with more than two categories in the following manner. Suppose there are three categories, 1, 2, and 3. In this situation, without loss of generality, one can label the category that is most likely early in the sequence as category 1 and the category that is most likely at the end of the sequence as category 3. We assume there are either two transitions in terms of the most likely category (123) or one transition (13). It would also be possible to extend the idea to more than two transitions using a similar scheme to the one described below.

To transform the series to a stationary sequence, we create a new sequence such that category 3 is most likely everywhere. To do this, we define two cut points 𝑘1 and 𝑘2 such that 𝑘1𝑘2 and 𝑘1,𝑘2 are chosen from 0,,𝑛. If the sequence is of length 𝑛, there are then 2𝑛+2 choices for (𝑘1,𝑘2). For each pair of cut points (𝑘1,𝑘2), create a new sequence where if 𝑡𝑘1 then flip categories 1 and 3, but leave 2 unchanged, if 𝑘1<𝑡𝑘2 then flip categories 2 and 3, but leave 1 unchanged, and if 𝑘>𝑡 we leave the categories unchanged. Now for each sequence count, the number of category 3 responses, and choose as the detrended sequence the one with the highest number of 3s. To break any tie, choose the sequence with smallest 𝑘2 and then smallest 𝑘1. Extensions to a greater number of categories than three are possible, but the algorithm becomes much more computationally demanding.

3. Nonstationary Series and the Effect of the Flipping Algorithm

Simulation results given in this section show the effects of nonstationarity and the flipping algorithm on the fit of the Discrete Autoregressive Model (DAR) of Jacobs and Lewis [7, 8]. The DAR model is used as an illustration of a simple stationary model, without implying that this model is necessarily the best one to use for categorical data in general. The primary motivation here is to show that nonstationarity can seriously compromise the fit, but detrending, while not a panacea, can help. First, we describe the DAR model, then give the simulation results.

The sequence {𝑋𝑡} has a binary DAR structure when it is formed according to the probabilistic linear model 𝑋𝑡=𝐼𝑡𝑋𝑡1+1𝐼𝑡𝑌𝑡,𝑡=1,2,3,,(3.1) where {𝑌𝑡} is a sequence of independent binary random variables with 𝑃(𝑌𝑡=1)=𝑝𝑡, and {𝐼𝑡} is a sequence of independent binary random variables, also independent from {𝑋𝑡} and {𝑌𝑡}. Let 𝑃(𝐼𝑡=1)=𝑞, where 0𝑞1 is fixed. Typically one assumes that 𝑋1=𝑌1. Note that 𝑋𝑡 is also binary, and it is a simple matter to show that if 𝑝𝑡=𝑝 then 𝑃(𝑋𝑡=1)=𝑝, and hence {𝑋𝑡} is a stationary series.

Figure 2 shows data simulated from a DAR(1) model with three different values of 𝑞. Note that the value of 𝑞 controls the probability of 𝑋𝑡 staying in the same state (0 or 1). As the figures show, if 𝑞 is very large, it is very likely that 𝑋𝑡=𝑋𝑡1, and long runs in the series dominate.

Our primary interest here is in estimation of 𝑞, as that is the most useful parameter in one-step forecasting, and in the nonstationary cases 𝑝 has no clear meaning. Indeed, it is easy to show that if one assumes the stationary DAR model then cor(𝑋𝑡,𝑋𝑡1)=𝑞. A simple method of moments estimator of 𝑞 would then be the lag one correlation between successive observations.

The data for the simulations consist of series of length 200 generated from a DAR(1) model with one transition at a chosen point 𝑘 where the probability of a 1 changes from 𝑝 to 1𝑝. The value of 𝑞, the correlation parameter, was one of 𝑞=0.1,0.3,0.5,0.7. The value of 𝑝 was chosen to be either 0.1, 0.3, or 0.5. Values less than 0.5 were chosen because of the way the detrending algorithm is defined. It is assumed that the 0 category is the most likely category before the transition point and that the most likely category transitions from 0 to 1. The transition point was chosen as either 𝑘=50,100,or150. The simulation model is the most favorable for the flipping algorithm; however, simulations from other models, described in Section 4, also show good performance for the algorithm.

For each value of 𝑝, 𝑞, and 𝑘, one thousand sequences of length two hundred were generated. The method of moments estimate of 𝑞 under the DAR(1) model was calculated. Next mean squared errors (MSEs) for the estimates of 𝑞 for each of these forty-five combinations of the parameters were calculated. Table 2 gives these MSEs, multiplied by 1000. Only the results for 𝑘=100 are shown because the results for 𝑘=50 and 𝑘=150 are similar.

The simulation error, on the scale given in the table, is around 1 for the smallest MSEs and 5 for the largest MSEs. The MSEs for the estimation of 𝑞 for the raw data (not detrended) are given in the top row for each value of 𝑝. The MSEs in bold-face type are those from the detrended series. Maximum likelihood estimates were also calculated, but the results are not shown because they are very similar to the ones given here.

From the table, one can make several observations. First, nonstationarity has a serious effect on the estimate of 𝑞, particularly when 𝑞 is small, which would be more typical for real data. Second, flipping the nonstationary series produces much better estimates of 𝑞. Further examination of the estimates shows that bias is a serious problem with nonstationarity. For example, if 𝑝=𝑞=0.1 and 𝑘=100, the average value of ̂𝑞 was 0.60. With flipping, the average value fell to 0.09, an almost total reduction in bias. (For a less extreme example, when 𝑝=0.1, 𝑞=0.7, and 𝑘=100, the average value of 𝑞 without detrending was 0.88. With flipping, it was 0.63.) This illustrates the potential value of the algorithm.

4. Weakly Stationary Categorical Time Series

The detrending algorithm produces a series which is stationary, but not in the strong sense described in the introduction. The output of the detrending algorithm is a series such that the identity of the most common category is the same over time, although its probability could change. We will term this type of stationarity categorical, or modal, stationarity, and it will be denoted by C(1). In general, one could have a series for which the identity of the 𝐽1 most likely categories remained the same, with all the others changing. This would be C(J) stationarity.

For completeness, we simulated 1000 C(1) stationary categorical time series of length 200 and estimated 𝑞 without applying the detrending algorithm. We also did the same for a nonstationary series where 𝑝𝑡=𝑃(𝑌𝑡=1) changes linearly with time. That is, 𝑝𝑡=𝛽0+𝛽1𝑡. By choosing different values of 𝛽0 and 𝛽1, one can control how rapidly 𝑝𝑡 changes with 𝑡 and what range of values is observed across the sequence. Further, we simulated strongly stationary series (which we term distributionally stationary, denoted by D(1)) and obtained estimates of 𝑞 without applying the detrending algorithm.

The results are given in Table 3. Clearly, trying to estimate 𝑞 from a nonstationary series results in poor estimates. Distributional stationary series produce good estimates of 𝑞, as the DAR model itself is distributional stationary. Estimates of 𝑞 from categorical stationary series are better than those from nonstationary series. Detrending a nonstationary series to C(1) using the flipping algorithm results in better estimates for 𝑞 than does starting with a pure C(1) stationary series, except for large values of 𝑞.

It is interesting that, even though the estimation procedure for 𝑞 assumes strong stationarity, parameter estimates from a weakly stationary series are a great improvement over estimates from series that are completely nonstationary. In turn, parameter estimates from series resulting from the flipping algorithm give estimates that are almost as good as estimates from D(1) stationary series.

5. Detrending the All-Star Data

Since 1933, Major League Baseball has played an All-Star game, where the best and most popular players from the American and National Leagues play against each other. The All-Star game was not played in 1945, and two games were played each year between 1959 and 1962. The result for the 2002 game was omitted because the game resulted in a tie. For the years where two games were played, the winner for that year was taken to be the league that scored the most combined runs against the other. A plot of the data was given in Figure 1(b), and a decadal summary of the data is displayed in Table 4. We focus on these data largely for the simplicity of illustrating a series with only one transition.

The series was detrended using the scheme outlined in the Section 3. The detrended series is the opposite of the original series until the year 1985 in other words, the detrending algorithm picked 1985 as the year that superiority switched from the National League to the American League.

Assuming a stationary DAR model, the value of 𝑞 was estimated by the sample lag-one correlation for both the original and detrended series. The estimate for 𝑞 for the original series is 0.28 and for the detrended series is 0.17. This is a substantial reduction and shows that detrending can have a large impact on estimates of 𝑞.

6. Discussion

In this paper, we introduce an algorithm for inducing a type of stationarity in categorical time series data. The algorithm does not really detrend the series in the traditional sense, because the kinds of examples we consider have an abrupt change in the probability that a particular category occurs, rather than a trend in the probability of the occurrence of a category. The algorithm uses mild assumptions on the data and produces a series which is stationary, but not in the strong sense. We term this weaker form of stationarity “categorical stationarity.”

Using a simple strongly stationary model, it was shown via simulation that fitting a strongly stationary model to nonstationary data can result in poor estimates of the correlation parameter but that the estimate can be dramatically improved by first using the flipping algorithm. Additional simulations show that fitting a model which assumes strong stationarity of the data to a series which is category stationary (without flipping) gives estimates which are less biased than the estimates would have been if the nonstationary data were fit.

It is clear that the most widely accepted definition of stationarity in categorical time series is too strong for some real categorical time series data. Such data typically have change points where the probability of the most likely category changes (or the most likely category itself changes). However, many models for categorical time series, such as the DAR and DARMA models, and other methods, such as spectral estimation [9, 13], were developed under the assumption that the series is strongly stationary. Additional work is needed to ascertain whether the use of methods intended for strongly stationary categorical time series would still be valid for categorically stationary categorical time series. Further extensions and improvements to the flipping algorithm are also left for future work.