#### Abstract

The predominant learning algorithm for Hidden Markov Models (HMMs) is local search heuristics, of which the Baum-Welch (BW) algorithm is mostly used. It is an iterative learning procedure starting with a predefined size of state spaces and randomly chosen initial parameters. However, wrongly chosen initial parameters may cause the risk of falling into a local optimum and a low convergence speed. To overcome these drawbacks, we propose to use a more suitable model initialization approach, a Segmentation-Clustering and Transient analysis (SCT) framework, to estimate the number of states and model parameters directly from the input data. Based on an analysis of the information flow through HMMs, we demystify the structure of models and show that high-impact states are directly identifiable from the properties of observation sequences. States having a high impact on the log-likelihood make HMMs highly specific. Experimental results show that even though the identification accuracy drops to 87.9% when random models are considered, the SCT method is around 50 to 260 times faster than the BW algorithm with 100% correct identification for highly specific models whose specificity is greater than 0.06.

#### 1. Introduction

Hidden Markov Models (HMMs) [1] are one of the statistical modelling tools showing great success and have been widely used in diverse application fields such as speech processing [2], machine maintenance [3], acoustics [4], biosciences [5], handwriting and text recognition [6], and image processing [7]. Despite the merit of simplicity and learning capabilities, HMMs are still facing open problems such as learning effectiveness and efficiency.

There are two major problems in HMM learning: (1) choosing model size (number of hidden states); (2) estimating model parameters. Regarding the first problem, state-of-the-art approaches normally train multiple HMMs with different numbers of states and the best one is selected using specific criteria (e.g., the Akaike information criterion (AIC) [8], the Bayesian Information Criterion (BIC) [9]). In order to tackle the second problem, traditional learning algorithms such as the Baum-Welch (BW) algorithm are used to iteratively optimize model parameters starting from , most often randomly chosen, initial set of parameters. Such iterative optimization heuristic approaches are prone to local optima. Therefore, multiple runs (typically, 10 [10, 11] or 20 [12, 13]) with several different initializations are performed and the optimal one of these is chosen. However, such iterative approaches with multiple trainings have significant drawbacks of time inefficiency and a high computational cost. Hsu et al. [14] introduced a noniterative method employing spectral-based algorithm for learning HMMs. It is simple and employs only a singular value decomposition and matrix multiplications. Nonetheless, it is evaluated in [15] and shown to be only applicable to identify systems when relatively few observations are available but fail completely for systems when the available observations are large. Fox et al. [8] proposed a sticky HDP-HMM which is a nonparametric, infinite-state model that automatically learns the size of state spaces and the smoothly varying dynamics robustly. However, this approach is computationally prohibitive when datasets are very large [9]. Therefore, in spite of the limitations, classical iterative approaches are still widely used to estimate model size and model parameters, for lack of alternatives.

The aim of this paper is to improve the effectiveness and efficiency in model learning compared to the conventional BW algorithm, in the sense of accurately and quickly finding the correct model. One of the HMM assumptions is that the observed data is only dependent on the hidden states given the model. Therefore, the observed data often reflects the structure and statistical properties of the model, which motivates us to introduce a data-driven preestimation procedure to estimate the number of states and choose proper initial model parameters.

We firstly provide insight into the essential features of an HMM model that help to improve the model’s expressiveness as a stochastic process [16]. This is conducted by inspecting the role of each hidden state in generating observation distributions as well as providing information on the model structure. Hidden states with a large influence on observation sequences increase the value of a model more than those without or with low influence. By analysing how the information flows through the HMMs, we determine which cases make a state have a high impact. As discussed in Section 3, persistent and/or transient-cyclic states appear to be high-impact states. Moreover, a model with high-impact states is highly* specific* and will be easy to identify. We introduce the term* specificity* as the minimum model distance between a model and the best of HMMs with one state less. On the contrary, some HMMs are in principle unidentifiable which has been proved in [17] by linking the learning of HMMs to the nonlearnability results of finite automata. Furthermore, there are models in between the learnable and the unlearnable HMMs, which are hard to learn from observation sequences. Such HMMs contain complex parameter configurations with low specificity and low-impact states. Overall, experimental results show that a better number of states and proper initialization learned by the proposed method increase the learning speed and accuracy of highly specific HMMs compared to the traditional Baum-Welch algorithm.

The remainder of the paper is organized as follows: in Section 2, the preliminaries about HMMs and the Baum-Welch learning problems are briefly reviewed, followed by the concepts and definitions of model characteristics such as model identifiability, model equivalence, and the minimality of models. In Section 3, the impact of states on model specificity is studied through the information analysis. Followed by the approximate identification framework in Section 4, experiments and results are discussed in Section 5. Finally, conclusions are given in Section 6.

#### 2. Preliminaries

An HMM [1] is a doubly stochastic process where the underlying process is characterized by a Markov chain and unobservable (hidden) but can be observed through another stochastic process which emits the sequence of observations. Let denote the number of states and the number of observation symbols. Let and denote the set of states and the set of observations, respectively. Using and to represent the state and the emitted observation at time , respectively, the state and observation sequences are denoted by vectors and , where , and is the number of states or observations in the sequence. A discrete time HMM model can be characterized by the quintuple [1]: the initial state probability distribution is a column vector , where the th element isthe state transition probability distribution matrix is , where the th element isand the observation probability distribution matrix is , where the th element isTo note that the state transition probabilities of state include both* incoming* and* outgoing* probabilities, the* incoming* state transition probabilities of are the th column vector of , denoted asand the* outgoing* state transition probabilities of is the th row vector of , denoted aswhere and represents the set of nonnegative real numbers.

##### 2.1. The Baum-Welch Learning Algorithm

One of the three basic problems for HMMs is the learning problem [1], which is often solved by an Expectation-Maximization (EM) algorithm [18], named the Baum-Welch algorithm [19, 20]. Starting with an initial guess of the model at random, the model parameters are iteratively reestimated as long as the new model has an increased likelihood compared to the previous one; that is, , where and represent the likelihood values of an observation sequence generated by the previous model and the newly updated model , respectively. This procedure continues until the likelihood converges to a stationary point. However, the BW algorithm suffers from the problem of getting stuck at a local optimum if the initial model parameters are not well chosen, which inspires this study to search for a better estimation of the initial parameters.

For the analysis, we need to calculate the likelihood of observations given the model, that is, . It can be written by the use of the projection operations; see, for instance, [16, p. 18]. Let and , wherethuswhere and which denotes the diagonal matrix of which the diagonal elements are the th column of .

Therefore, the* likelihood* of the observations given the model can be expressed aswhere is a column vector of length with all entries equal to 1; that is, . For the convenience of calculations, the logarithm of likelihood* log-likelihood* (LL) is often used rather than the likelihood. Moreover, in this dissertation, we use* unit log-likelihood*, an averaged LL, to present the LL per single observation, that is, , where is the number of observations. Within this paper, the term* log-likelihood* is used to represent* unit log-likelihood* for simplicity.

##### 2.2. Definitions of Model Characteristics

In this paper, we determine the learnability of HMMs through model identifiability. If two models are equivalent, the true model cannot be uniquely identified. Hence we firstly introduce the definition for model equivalence. Note that the HMM learning can be considered as a probability distribution specific problem, where every HMM has to be identified from the observations generated according to its own likelihood distribution. Therefore, the equivalence of HMMs can be defined based on their observation likelihood distributions as follows.

*Definition 1 (HMM equivalence). *Two HMM models and are* equivalent* if and only if both models have the same observation emission probabilities (i.e., likelihood distribution over time series) for every observation sequence alternatively,

Note that the observation probabilities can remain the same by permuting the states of since the states can be arbitrarily labeled. The model with permuted states is called a* trivial equivalent* model of the original model as defined in [21]. We consider* trivial equivalent* models as the same model. In order to compare the models in later sections, we need to label states in a unique way such that corresponding states receive the same label. Therefore we define a process to normalize HMMs as follows.

*Definition 2 (HMM normalization). *For each state , a score is calculated by . Based on the score, we sort the states in ascending order.

Additionally, we can always construct an equivalent HMM with additional state numbers [22]; hence, in this paper, we consider HMM identifiability only when it is* minimal*, as defined below.

*Definition 3 (HMM minimality). *An HMM is* minimal* if and only if it has equal number of states to or fewer number of states than any other equivalent model ; that is, . Model is called a* simpler* model of if they are equivalent and .

*Definition 4 (HMM identifiability). *An HMM is* identifiable* if and only if it is minimal and there does not exist any* nontrivially equivalent* model with an equal number of states; that is, .

Moreover, in this study we only address the identification of* stationary* (or* homogeneous*) HMMs where the prior probabilities can be eliminated in calculations. The initial state prior probability distribution has an influence on learning only at the beginning of an observation sequence and its impact on large sequences vanishes over time and thus can be excluded for learning HMMs in practice. A stationary HMM is defined as follows.

*Definition 5 (stationary HMM). *An HMM is* stationary* if its state distribution remains the same at every time instant; that is, , where equals the equilibrium state distribution; that is, [23, p. 4902].

The element is a column vector with , and . The element represents the probability of going from state to state while emitting the observation by state , that is, .

Our proposed learning approach is based on the properties of observation sequences that make a state have a large impact on the model. To describe the degree of influence that a state can make on a model, we define a new term called* specificity * as the* distance* between model and the* best* model with one state less. By* best*, we mean that it matches the most on observations generated by the original model among all the one-state-fewer models, which also means that it has the minimum model distance to the original model . A general definition of* model distance* is as follows.

*Definition 6 (HMM model distance). *A model distance between two HMMs and is the difference of the unit log-likelihood of an observation sequence [1, p. 271]:where refers to the expectation operator, is an observation sequence generated by model , and is the size of the sequence. Equation (11) is basically a measure of how well model matches observations generated by model , in comparison with how well model matches observations generated by itself [1]. The* specificity* of a model can be then defined as follows.

*Definition 7 (HMM specificity). *The* specificity* of an HMM with states iswhere represents the set of all HMMs with states and is the length of an observation sequence generated by . We denote the optimal model with the minimum distance to model in (12) as .

We have to note that, to use Definitions 6 and 7 in practice, we will calculate the expectation with a single generated observation sequence. We assume that this sequence is long enough such that it is a typical sequence and gives a stable value which comes close to the expected value and as such is independent of the exact sequence, as is done by Rabiner [1].

To use the above definitions on a limited set of observation sequences, we have to rely on an approximate equivalence approach. In order to compare the HMMs according to the likelihood probability given a set of observation sequences , we have to define a threshold on the model distance to decide whether two HMM models are equivalent or not.

*Definition 8 (distance threshold of equivalent HMMs). *The* distance threshold* is defined aswhere is the asympototic distribution of log-likelihood with , the element represents randomly generated sequences by model , is the length of an observation sequence, and is the total number of observation sequences [24]. Duan et al. [24] prove that the distribution of the log-likelihood can be approximated by a normal distribution . According to the “three-sigma” rule, the interval contains 99% of the whole distribution. Thus a sequence has a certainty of being generated by the model if its log-likelihood . As defined in Definition 1, two models are equivalent if and only if both models have the same likelihood distribution on observations. Hence for any sequence generated by model , if has a log-likelihood within the interval, that is, , we can say the two models are approximately equal. Therefore, the model distance threshold of equivalence is approximated as of the reference model for practical use.

As defined in Definition 3, a model is* minimal* if and only if it has equal number of states to or fewer number of states than any other equivalent models. In order to check model minimality in practice, we verify if there exists no one-state simpler model which is equivalent to model , in particular, to verify if the minimum distance between and (i.e., the specificity of ; see Definition 7) is outside the threshold of equivalent models defined in Definition 8. Therefore, the practical condition to check* model minimality* is defined as follows: a model can be approximately taken as* minimal* if the absolute value of its* specificity* is outside the* distance threshold* of 3-sigma; that is, .

#### 3. Impact of States on Observation Likelihood

We start the study through an information flow analysis as to see the impact of different types of states on model specificity.

##### 3.1. Information Flow Analysis

Our aim is to understand which parameters make an HMM have a higher specificity. However, an analytical equation for the specificity function requires us to know the optimal one-state-simpler model , which is still an open problem. This leads us to an alternative approach by analysing* state properties* of models. In the following analysis, we will study which properties make up a* high-impact* state and which do not. A* high-impact* state makes itself more specific with a significant influence on ; thus it emits relatively unique patterns of observation sequences which can be distinguished from other states. Using this analysis, we will in this paper propose a framework to identify the* high-impact* states.

To study what influences the* specificity* of an HMM, we analyse the impact of a state on the likelihood and how it contributes to as follows. Consider in (10). It can be seen as a probability used in predicting the future from the past and it represents the information flow from the past to the future. Hence we will analyse the contribution of a specific state to this probability. There are three cases whereby the probability of the state plays a role in the information flow, as shown in Figure 1:(a)The present state probability depends on the previous state probability and partly determines the observation probability .(b)The present state probability depends on the observations and determines the succeeding state probability. The observation probability depends on which is updated with the knowledge of .(c)The present state probability is determined by the past state probability and affects the future state probability.

**(a)**

**(b)**

**(c)**

##### 3.2. High-Impact States

We now investigate the high-impact states on likelihood , more specifically on the* specificity *. Such states should have a* high* and* unique* impact on the likelihood where* high* means a high information flow passing from the past to the future states and* unique* ensures that no other states can fill in the same role, such that it cannot be mimicked by other states either with combined similar probabilities or emitting similar observation probabilities. For instance, a state with a probability of 0.5 can be mimicked by a combination of two states with probabilities of 0.1 and 0.9, respectively; or a state with observation emission probabilities of 0.5 is also not unique. Note that a relatively high or low probability is more difficult to be mimicked than 0.5 in the previous examples. Hence for the three cases outlined in Figure 1, the state plays an intermediate role in predicting the future based on the past; we can define the following conditions for high-impact state, respectively:(a)(1) The incoming transition probabilities (see (4)) of state at time are maximal or minimal; that is, ; (2) state has a* dominant* observation at time , meaning the observation probability (see (3)) is maximal; that is, .(b)(1) The outgoing transition probabilities (see (5)) of state at time are maximal or minimal; that is, ; (2) state has a* dominant* observation at time ; refer to condition a(2).(c)Refer to conditions a(1) and b(1).

For high specificity, the above conditions should be met for all states of a model. Note that these conditions are based on state* transition* and* observation* probabilities. Regarding* transition* probabilities, a highly specific HMM should contain* persistent* and/or* transient-cyclic* states, as defined below:(i)A* persistent* state is a state with a higher* self-transition* probability than the probabilities to transit to other states. When all states of an HMM are* persistent*, the HMM remains for a certain period in one state before changing into another state. Such HMM is called a* persistent* HMM.(ii)A* transient* state, on the other hand, has a lower* self-transition* probability. A* transient-cyclic* state has one specific incoming transition probability which is high and dominant and one outgoing transition probability which is high. When all states of an HMM are* transient-cyclic*, the HMM flips from one state to another, mostly following a certain pattern (e.g., ). Such HMM is called a* transient-cyclic* HMM. Otherwise, it is called a* transient-acyclic* HMM.(iii)When an HMM contains both* persistent* and* transient-cyclic* states, we call it a* hybrid* HMM.

Secondly, regarding* observation* probabilities, a highly specific HMM should contain* privileged* states, which is defined as follows: A* privileged* state is a state with at least one* dominant* observation probability.

HMMs containing only* privileged* states are called* privileged* HMMs. This is possible when the number of observations is larger than the number of states; that is, .

Considering both* transition* and* observation* probabilities, we define a* highly specific* HMM as an HMM containing only* persistent* states and/or* transient-cyclic* states, which will be shown as identifiable from observation sequences. Note that it is impossible to identify all minimal HMMs, especially when the influence of some states on a model is low, in the sense that such states can be neglected and the resultant simpler model is comparable to a complex one. In order to learn a* minimal identifiable* HMM, we propose in a later section an effective and efficient model approximation method which identifies* persistent* states with segmentation and clustering methods and* transient-cyclic* states with a transient analysis based on the following theorem.

Theorem 9. *The presence of transient-cyclic states with dominant observations can be identified as follows: for values of , if and , where , represents the relative frequency (i.e., the ratio of the number of times) of event occurring in the observed sequence, which is also the predicted probability of the occurrence of event ; then for*(a)

*if , that is, , , the triple does not reveal hidden transient-cyclic states and thus it can be modelled by a 1-order Markov model,*(b)

*if , the triple reveals that hidden transient-cyclic states are present:(i)*

*If , the triple reveals states with dominant observations.*(ii)*If , the triple reveals states with dominant observations and an extra mixing state.*The proof is in Appendix A.

The definitions of a Markov model and a* mixing* state used in the theorem are given as follows:(i)A* Markov* model is a stochastic process that is characterized by a Markov chain. It models the observed states with a random variable which satisfies the Markov property; that is, the distribution of the current state depends only on that of the previous state instead of the whole historical states. The state transition probability distribution and the initial state probability distribution are denoted by the same expressions as the HMM defined previously. The model can be written as .(ii)A* mixing* state is a state which outputs the same observation probabilities as a mixture of other states. HMM models containing mixing states are problematic, since one state has the same output distribution as a convex mixture of some other states’ output distribution; therefore it is difficult to distinguish the ground truth state between a single state and a mixture of several states [14].

##### 3.3. Equivalent States

Now we try to understand when a state has zero impact on the specificity such that in the extreme case a simpler HMM exists with the same distributions. Considering the information flow , for the first arrow, the influence of a state is negligible when (1a) is close to zero; (1b) the state has an equal influence as another state if the probability equals that of another state; or (1c) the influence of the state can be mimicked by the other state if the probability is constant. Note that if it is neither constant nor the same as another state, the state probability will fluctuate which makes that its influence cannot be incorporated into that of other states. For the second arrow, the influence of the state can be incorporated into that of other states if (2a) is the same as the probabilities of another state or (2b) the probability distribution is not dominant.

In case (1a) the state plays no role and can be removed, in cases (1b) and (2a) the state can be merged with a similar state, and in cases (1c) and (2b) the influence of the state can be “taken over” by some of the remaining states. This leads to the conditions for eliminating redundant (i.e., equivalent) states as shown in Table 1. Note that the difference between “removal” and “taken over” is that, by removing a state, its information is removed together with the state, while “taking over” a state means that even though the state is deleted, its information stays and is passed to other states instead.

Based on the conditions of equivalent states defined in Table 1, we now can formalize the results of our analysis in sufficient conditions for nonminimality HMMs as follows.

Theorem 10. *A stationary HMM is not minimal if one of the following conditions holds:*(i)*The HMM contains a state that has zero incoming state transition probabilities; that is, .*(ii)*The HMM contains two states and that have the same state transition probabilities; that is, and .*(iii)*The HMM contains two states and that have the same observation probabilities and meets one of the following conditions: (1) they have the same incoming state transition probabilities; that is, ; (2) they have the same outgoing state transition probabilities; that is, ; or (3) and .*(iv)*The HMM has two observation values and contains a state that has constant incoming state transition probabilities; that is, and for all , has nondominant observation probabilities; that is, .*

The proof is in Appendix B.

##### 3.4. Low-Impact States

Unlike high-impact or equivalent (zero-impact) states, some states have larger-than-zero but very low impact, which makes them hard to learn. Such states are called low-impact states. HMMs containing these states are called* hard to learn* HMMs, as will be shown later.

Since low-impact states are in between high-impact and equivalent states, they meet a combination of partial conditions defined for both cases. As introduced in Section 3.2 for high-impact states, a learnable HMMs should contain only* persistent* and/or* transient-cyclic* states with* privileged* observations, while an unlearnable HMMs contains states which contains one or two states under conditions defined in Theorem 10. Therefore, combined partial conditions of both can be defined for* hard to learn* HMMs.

An HMM is* hard to learn* if it contains mostly* persistent* or* transient-cyclic* states with* privileged* states with dominant observations and is also under one of the following conditions:(i)There exists a mixing state whose observation distribution is a mixture of the observation distributions of two other states and ; that is, , where .(ii)There exists a state with constant incoming transitions, self-included; that is, , where .(iii)There exists a state with constant incoming transitions, self-excluded; that is, , where .(iv)There exists a state with constant outgoing transitions, self-included; that is, , where .(v)There exists a state with constant outgoing transitions, self-excluded; that is, , where .(vi)There exist two states and with the same observation probabilities , where .(vii)There exists a state with constant (nondominant) observation emissions; that is, , where .

#### 4. Approximate Identification Algorithm

An HMM is either identifiable or unidentifiable. In order to describe how hard it is to identify a model, we use the term* learnability*: for an identifiable HMM, it can be easy, moderate, or hard to learn. Thus, before presenting the approximate identification algorithm, we firstly explain our hypothesis on the correlations between model learnability and specificity as shown in Figure 2, which will be validated experimentally in Section 5. HMMs containing states with higher specificity have higher distances with less complex models and as shown later are easier to learn, and vice versa. Therefore, we classify HMMs into three identification categories based on their specificity: (1)* learnable* HMMs with relatively high specificity; (2)* hard to learn* HMMs with low specificity; and (3)* unlearnable* HMMs with almost zero specificity. Our focus is to identify learnable and highly specific models with high-impact states.

##### 4.1. Algorithm Structure

Based on the previous analysis of the hidden states, we can construct an algorithm that identifies high-impact states directly from the observation sequences. Inspired by signal processing method such as Empirical Mode Decomposition- (EMD-) and wavelet-based denoising methods [25], which decompose the noisy signal into a number of components, filter each component, and finally reconstruct the denoised signal using the filtered components, here we reassemble the above procedures as follows: an unknown HMM is composed out of a number of hidden states. These states can be identified from observations and combined to form a reconstructed , as shown in Figure 3. In such manner, we decompose the model identification procedure into a combination of state identifications. The approximate state identification approach firstly recognizes persistent and transient states separately from observation sequences, then combines them into a set of identified states, and finally reduces or merges similar states into a new set of reconstructed states. The details of the identification framework will be explained as follows.

Models with high-impact states generate specific samples which are unique. Therefore, the states are identifiable through data analysis. The output of a* persistent* state is likely to stay in a period within which the behavior at each time step is “similar," which we call a* regime*. Thus a segmentation approach is advised splitting a signal sequence into regimes by identifying the specific behaviors within certain periods. In order to identify* transient-cyclic* states, our approach is to capture the changing of behaviors with a transient analysis based on Theorem 9.

Therefore, we propose a framework to identify most high-impact and minimal states: (1) persistent privileged states; (2) transient-cyclic privileged states; and (3) hybrid: persistent and transient-cyclic privileged states. A schema of the proposed algorithm is shown in Figure 4. We assume that both persistent and transient states exist in the model; therefore, both segmentation and clustering and transient analysis methods are applied on the data, followed by a reparameterization procedure to combine the parameters learned from both the previous methods. Finally, a model reduction step is conducted in the end to form a simplified minimal HMM model. We name our proposed method as Segmentation-Clustering and Transient analysis (SCT) framework.

##### 4.2. Segmentation-Based Approach

The segmentation-based approach is defined using the following steps: Step 1: signals are split by segmentation techniques into different regimes with different signal behaviors; Step 2: the “similar” regimes of signals are grouped together by clustering techniques according to their similarities (the clusters are labeled and each cluster is a hidden state); Step 3: a clustering validation index is employed to determine the proper number of states; finally, Step 4: HMM parameters are estimated by calculating statistical occurrences of the observations and the estimated hidden states.

*Step 1 (identification of persistent states by segmentation). *Data sequences emitted by* persistent* states can be segmented into subsequences with constant behavior (observations are drawn from a stationary distribution). The transition from one state to another can be identified by detecting a difference in signal behavior. This is called a* change point*. In this paper, we propose a sliding window-based Bayesian segmentation based on the test of [26]. The Bayesian probability is calculated to determine whether two sequences have been generated by the same or by a different multinomial model.

A* multinomial* model is a stochastic process where the observations follow a multinomial distribution. It is sufficient to model observations instead of states, since the observations in a multinomial model can represent states correspondingly with absolute state knowledge. Each observed symbol at time is independent and falls into one of categories with a fixed probability, denoted by , where . The observation probability distribution is denoted by , where . The model can be represented with a compact notation .

The first sequence always starts from the last change point (the first point if at the beginning) and ends at the current time point; the second sequence is a fixed-length sliding window starting from the next time point. If the two successive sequences are very likely from different models, the point in between is marked as a change point. The procedure repeats until the end of the signal.

*Step 2 (combination of states by clustering). *With HMMs, segments corresponding to the same state will recur over time. Assuming that there is a finite number of states, segments with the same states are detected and clustered together. In this study, the classical -*means* clustering approach [27, 28] is chosen to group and label segments. In our case, the -*means* clustering algorithm tries to group the segments into unique states based on the mean value of data features within each cluster, given by . Because of the fact that -*means* clustering encounters the problem of randomness in selecting initial parameters, we perform a preliminary step for selecting centroid starting locations. The selected properties in the segmentation step are a one-dimension sequence, which contains subsequences with equal length. The median values of the subsequences are then used as initial centroid locations.

*Step 3 (cluster validity). *In order to select the optimal number of clusters, we propose a constraint-based clustering analysis considering both the cluster separation capabilities of hidden states and the simplicity of HMM models.* Constraint 1*: lower Davies-Bouldin index (DBI) [29] suggests that the clustering exhibited a better intracluster grouping and intercluster separation of each state.* Constraint 2*: instead of selecting the minimum DBI, an allowance with a threshold of 0.05 is given so that a smaller number of states will be selected if its DBI is within the range of .

Suppose dataset is partitioned into disjoint nonempty clusters and let denote the obtained partitions, such that (empty set), , and . The Davies-Bouldin index [29] is defined aswhereindicate the intracluster diameter and the intercluster distance, respectively. The partition with the minimum Davies-Bouldin index is considered as the optimal choice.

*Step 4 (parameter estimations). *Parameters of an HMM (i.e., probability matrices ) can be calculated by simply counting the occurrence of the observed signal and the hidden states (i.e., labels retrieved from clustering), which is the same calculation as the reestimation step of the Baum-Welch algorithm [1]:

##### 4.3. Transition-Based Approach

In this section we present a transition-based approach in order to identify transient-cyclic states. In order to estimate the observation matrix, we apply Theorem 9 which is dedicated to identifying transient-cyclic privileged with or without mixing states. Firstly, the first-order transition probabilities can be identified by a Markov model assumption via counting the occurrence of the observation sequences:Similarly, a second-order transition probability can be modelled by an HMM assumption and calculated by counting:where . A threshold for dominant probabilities is calculated asIf the two continuous first-order probabilities are dominant, that is, and , where , then the division of the second-order transition probabilities calculated from a Markov chain and from an HMM assumption isIf , there is no transient-cyclic states. Otherwise, the first-order transition probabilities are taken as the dominant observation probabilities and used to build the observation matrix. If , a mixing state is present and one extra state is added to the observation matrix with a uniformly distributed probability of . In the end, we map each observation value , , and on a different state because we look for states with at least one dominant observation value. If a state has multiple observation values, they will be merged into one state. See Table 1 for conditions of model state reduction.

Take a simple 2-observation case as an example; we generate a 10-series sequence with length of 1000. The first-order observation transition probabilities are . If the calculated occurrence probabilities are , the dominant transitions larger than are and . Thus, the dominant second-order probabilities are , equal to calculated by a Markov model, while being equal to calculated by an HMM. Thus, the division of the two is . Since both probabilities are smaller than 1, they are dominant states and there is no mixing state. Therefore, we map the two dominant probabilities to the observation probabilities of two states and the final observation matrix is .

Furthermore, for calculating the prior and transition probabilities, we stick to the assumption that privileged state behaviors can be reflected by observation properties; therefore, we assume the number of states is the same as the number of observations and use a Markov model for learning state probabilities.

The prior and transition matrices can be calculated by counting the observation occurrences:Therefore, a model is learned containing only transient-cyclic states.

##### 4.4. Reparameterization

Parameters learned for both* persistent* and* transient-cyclic* states are combined together by a procedure called* reparameterization*. Let be the parameters for* persistent* states and be the parameters for the* transient-cyclic* states. Let and be the number of* persistent* and* transient-cyclic* states, respectively. Thus the combined number of states is . The parameters of the combined model can be calculated aswhere the function* Normalize* ensures that the sum of the given vectors equals 1 and the* Stochastic* function ensures that the sum of each row of the given matrix equals 1.

##### 4.5. Model Reduction

After combining the persistent and transient-cyclic states, redundant states may occur. We introduce a model reduction procedure which removes redundant states to obtain minimal HMMs according to the conditions defined in Theorem 10. We relax the strict conditions given in the theorem via adding thresholds.(i) *An HMM contains a state ** that has zero incoming state transition probabilities; that is, *. Instead of using a zero vector as a strict rule, a threshold is defined to allow near-zero cases such that if the sum of the incoming transition state probabilities of state is lower than threshold , that is, the state can be removed.(ii) *An HMM contains two states ** and ** that have the same state transition probabilities; that is, ** and *. We replace the equivalence condition by a subtraction calculation. If the maximum of the incoming and outgoing state transition probabilities of the two states and is below a threshold , that is, the two states can be merged into one.(iii) *An HMM contains two states ** and ** that have the same observation probabilities *,* and meet one of the following conditions: (1) they have the same incoming state transition probabilities; that is, **; (2) they have the same outgoing state transition probabilities; that is, **; or (3) ** and *. Similar to (ii), we use subtraction instead of strict equivalence with added threshold . Moreover, the* AND* condition and the* OR* conditions can be represented by selecting the maximum and minimum values, respectively. Therefore, if the two states and can be merged.(iv) *An HMM has two observation values ** and contains a state ** that has constant incoming state transition probabilities, ** and **, and ** has nondominant observation probabilities, *. where is the average of and the state can be “taken over.”

The selection of thresholds is conducted empirically since the correlation between the likelihood value and each condition is complex and is not our focus in this paper.

#### 5. Experiments

Simulated data has been used to evaluate the effectiveness and efficiency of the proposed SCT inference framework. The simulated data were sampled from different classes of HMM models: nonminimal equivalent HMMs, identifiable (selected and random) minimal HMMs, and hard to learn HMMs.

##### 5.1. Nonminimal Equivalent HMMs

Equivalent HMMs contains two cases: (1) two HMMs with the same number of states, where permutations of states apply to both models; (2) two HMMs with different numbers of states. This experiment focuses on case (2) and aims to test the model reduction conditions defined in Table 1, where a nonminimal HMM can remove, merge, or take over its redundant states to become an equivalent minimal HMM. One model is selected under each of the three reduction conditions and is used to construct an equivalent model by removing the redundant state (set as the last state here). The model parameters are listed hereafter, respectively: With a* removable* state: HMM : HMM : With a* mergeable* state: HMM : HMM : With a* taken-over* state: HMM : HMM :

Each of the reference nonminimal models is used to generate 1000 datasets of random observations containing sequences of observation points. The datasets are used to determine log-likelihood distributions of the models and the distance threshold of equivalent models defined in Definition 8. By calculating the percentage of the log-likelihood values of the minimal model which fall inside the threshold of equivalence for the nonminimal model , we can obtain a confidence level. The results in Table 2 show that the two models are approximate equivalent models for all the three cases with high confidence levels. Moreover, the log-likelihood histograms are plotted in Figure 5. The highly overlapping histograms further demonstrate the model equivalence for the three examples.

##### 5.2. Highly Specific HMMs

As discussed previously, persistent and transient-cyclic HMMs with privileged observations are identifiable HMMs that have a high specificity. In this section, we compare the learning of such identifiable HMMs with the Baum-Welch (BW) algorithm and the proposed SCT method.

Firstly, we constructed 9 persistent and 9 transient-cyclic models as ground truth models with a fixed equal number of states and observations () ranging from 2 to 10. These models can be expressed as follows: Persistent : Transient-cyclic :

The state self-transition probabilities for persistent models and the transition probabilities to the next neighboring state for transient-cyclic models were both set to a value close to 1, noted as and , respectively, where . The remaining transitions have equal probabilities; that is, . For all the 18 models, the observation matrices are set the same as the transition matrices of persistent models in order to obtain privileged observations. The initial parameters are uniformly distributed.

We also generated 20 hybrid models as ground truth models containing both persistent and transient-cyclic states. The number of states and for both cases was randomly chosen from 2 to 3, amounting to a total number of states within a range of . The rest of the parameters were generated in the same manner as before. For simplicity, we use to represent a matrix containing only element . Thus, a hybrid model can be represented as follows: Hybrid :

The experiments were carried out by using each of the constructed models as a reference model to generate a dataset of 10 series of 1000 observations. The first seven series were used as a training set and the last three series as a test set. The true number of states was assumed to be unknown and the learning methods have to select the number of states from a state pool of . For the BW learning algorithm, models were generated with a number of states ranging from 2 to and the model with the best is selected by the AIC criterion [30]. The learning of the BW algorithm was repeated 20 times to eliminate local optima and the one with the minimum AIC value is selected. In total, models were generated to determine an optimal model. Moreover, for comparison purpose, we also train BW with a given number of states ; therefore, a total of 20 models were generated and a best model is selected by the AIC criterion. On the other hand, for the proposed SCT method, the number of states is selected by the clustering validation method in Step 3 (See Section 4.2). Only one SCT model is trained and used for comparison the two best models selected by the BW method (with and without a given ).

In order to use the 3-sigma rule to indicate if a true model is learned, 100 datasets of 10 series of 1000 observations were generated from each of the true models and used for calculating the log-likelihood distribution (see Definition 8). If the log-likelihood difference between the ground truth model and the learned model is outside the distance threshold of , we consider that the true model has not been found (i.e., a local optimum is learned). Moreover, for better understanding the log-likelihood results, we additionally calculated the log-likelihoods for the following models: (1) : the best model with states selected from 100 randomly generated models and trained with BW; (2) : the best model with states selected from 100 randomly generated models and trained with BW; (3) a multinomial model: the model assuming that there are no hidden states and the observations are the actual visible states. If an HMM model has similar log-likelihood as a multinomial model, the states have no impact on the model.

In addition to log-likelihoods, we define other performance indicators of accuracy as follows: (1) the* percentage of convergence* is the percentage of 20 BW learned models which did not fall into a local optimum; (2) the* percentage of identification* is the percentage of the best BW or the SCT learned models which did not fall into a local optimum; (3) the* parameter distance* is defined as the mean difference of the triples () between two HMM models. If the two models have different state space, values of 0s are filled into the probability matrices of the simpler model in order to have an equal number of states to the complex one. Moreover, all the permutations of the models are considered and the minimum distance is chosen as the* parameter distance*.

An average information of the ground truth models can be found in Table 3. The hybrid models are less specific than the persistent or transient-cyclic states only models, which make them harder to identify. Detailed learning results can be found in Table 4. The results show that the proposed SCT method learns much faster than the traditional BW algorithm, with a speedup around 180 to 260 times. The BW algorithm tends to overfit the model by using a larger number of states, resulting in a higher parameter distance. Even when the true number of states is given, for persistent cases, the BW still cannot learn correctly, which has a larger test-set log-likelihood difference and a lower convergence and identification rate.

Learning results for a 10-state persistent model, a 10-state transient-cyclic model, and a 6-state hybrid model are used as examples for visualization. For the persistent model, the iterative learning process is shown in Figures 6(a) and 6(b). In Figure 6(a), the BW training was conducted with an unknown number of states , while in Figure 6(b), was given. Similarly, results for the transient-cyclic model are shown in Figures 7(a) and 7(b). and the hybrid models are in Figures 8(a) and 8(b). The figures show that the proposed SCT method starts from a good initial model at the beginning and converges much faster than most of the 20 randomly initialized BW models which start from an almost equivalent level of the multinomial model for which hidden states play no roles. Moreover, although some of the best BW model converges in the end, the log-likelihood values are still not as good as the SCT method. We see that some of the repeated 20 models with states have been stuck in a local optimum with similar log-likelihood to model or .

**(a)**Log-likelihoods during iterative training, BW with unknown

**(b)**Log-likelihoods during iterative training, BW given correct**(c) Heatmap of HMM model parameters**

**(a)**Log-likelihoods during iterative training, BW with unknown

**(b)**Log-likelihoods during iterative training, BW given correct**(c) Heatmap of HMM model parameters**

**(a)**Log-likelihoods during iterative training, BW with unknown

**(b)**Log-likelihoods during iterative training, BW given correct**(c) Heat map of HMM model parameters**

In order to compare the model parameters, heat maps of the original and inferred state transition and observation matrices are plotted in Figures 6(c), 7(c), and 8(c). A lighter color indicates a higher probability value close to one, while a darker color indicates a lower probability value close to zero. We notice that the BW method with an unknown learns a complex model with two more states than the true model for all the three cases, which is overfitting, while the SCT approach learns the state size correctly. Moreover, the SCT method has a one-to-one correspondence of the high probabilities (in white/light-yellow) between the transition and observation parameter matrices, meaning the SCT trained model is almost equivalent to the reference true model. However, for the BW method, especially when is unknown, there are no one-to-one relations in both transition and observation matrices, noticeable by some of the varied colors of heat maps from the true model. It means that some of the probabilities are wrongly learned.

##### 5.3. Hard to Learn HMMs

For each of the seven hard to learn conditions defined in Section 3.4, we construct five ground truth models, resulting in a total of 35 models. For each model, persistent and transient-cyclic state numbers are randomly generated from a range of . The privileged state probability is set randomly within a range of . The remaining probabilities are uniformly distributed. For conditions (ii), (iii), (iv), and (v), one extra state is generated accordingly to the specified conditions. For condition (i), state 3 is defined as a mixing state of states 1 and 2. For condition (vi), the first two states are set to have the same observations. For condition (vii), state 2 is set to have a constant observation emission probabilities. The rest of the experiment is set the same as previous experiments designed for identifiable models. Results show that only 31% of the 35 ground truth models are specific with an average specificity of 0.04. A detailed comparison of the learning results is presented in Table 5.

From the results in Table 5 we can see that the BW algorithm is slower than the proposed SCT method with almost double learning convergence iterations. The SCT method is around 230 times faster than the BW algorithm. A positive average delta indicates that the BW method mostly overfits the true models with an average of 1.71 extra states, while the SCT has a negative delta indicating a slightly underfitting with an average of 0.66 fewer states. Even though the test-set log-likelihood difference of the SCT is higher than the BW method, the average parameter distance further proves that the BW algorithm tends to overfit the models in order to have a lower log-likelihood. Moreover, the percentage of convergence reveals that the number of repetitions (e.g., 20 times in this experiment) is still necessary for the BW method to learn effectively, even with the trade-off of longer learning time. Lastly, the SCT has a slightly lower but compatible identification percentage which has a significant learning speedup in return.

To visualize the results, we select two models under condition (i), a state being a mixing state, and condition (v), a state with constant self-excluded outgoing transaction, as examples shown in Figures 9 and 10, respectively.

**(a)**Log-likelihoods during iterative training, BW with unknown

**(b)**Log-likelihoods during iterative training, BW given correct**(c) Heatmap of HMM model parameters**

**(a)**Log-likelihoods during iterative training, BW with unknown

**(b)**Log-likelihoods during iterative training, BW given correct**(c) Heatmap of HMM model parameters**

Figures 9 and 10 show that the BW algorithm with an unknown overfits the truth models with two extra states while the SCT method underfits the model with one state fewer where both the mixing state and the state with the same outgoing transactions in the two examples are merged into other states because they are not specific enough to be identified.

##### 5.4. Random HMMs

In this experiment, we generated 10000 random HMM models configured with a combination of random and random . In order to guarantee that each HMM is minimal, we select models according to two criteria: (1) the model should have a higher test-set log-likelihood than the one of a multinomial model; (2) the model compared to the best state model should not satisfy the three-sigma rule for model equivalence criteria defined in Definition 8. A random HMM is discarded if it is not minimal. In the end, we obtain 149 specific minimal HMMs. The training procedure is conducted in the same way as in Section 5.2.

Experiment shows that the average specificity of the true models is 0.03, which is around 10 times less specific than the identifiable models used in the previous experiments in Section 5.2. Moreover, the mean of the log-likelihood distribution of the true model is 1.58, which is also much higher than the identifiable models. The above results indicate that random models are less specific and therefore less identifiable. A detailed comparison of the identification results is shown in Table 6.

The results show that the SCT method needs in average one more iteration than the BW algorithm and the identification results are less adequate because the models are not specific enough to be estimated correctly. However, the speedup of the SCT method shows an improvement* vis-a-vis* the Baum-Welch method, around 50 times. Both of the approaches overfit the models with an average of more than one state.

Figure 11(a) provides the dependence between true model* specificity* and test-set* log-likelihood difference* with the true models. When the specificity is too low, the SCT method identifies less correctly the models. Thus, the less specific the model is, the harder it becomes for the SCT method to learn. For the purpose of comparison, we plot the same figure in Figure 11(b) but for highly specific models which were generated in Section 5.2. The results further confirm that, for highly specific models, when the specificity is relatively low, the SCT method outperforms the BW method. The log-likelihood differences of the models learned by Baum-Welch have significantly increased indicating that completely wrong models are learned.

**(a)**

**(b)**

The results are expected because the SCT method is designed for highly specific models but not for random ones with less specificity. In order to see the influence of specificity for the SCT method to learn correctly, we plot the identification accuracy versus the specificity thresholds ranging from −0.01 to 0.2 with a step of 0.01 as shown in Figure 12. The models are selected when they have a specificity higher than a specificity threshold; then the percentage of correctly identified models within the selected models is used as the identification accuracy.

The identification accuracy of the SCT method starts with a low value of 87.9% and generally increases with an increase specificity threshold. When the specificity threshold is at 0.06, the identification percentage of the SCT drops to 93.8%. It is caused by a single case which is observable in Figure 11(a) with the highest log-likelihood difference. Such case cannot represent the dependency trend between specificity threshold and identification accuracy and thus can be ignored. When the threshold is higher than 0.06, the proposed SCT method converges to an identification of 100%.

#### 6. Conclusions

This paper studied the possibility of identifying HMMs from properties of the observation sequences directly. We conducted an analysis of the information flow throughout an HMM. Based on this analysis we were able to show that there are two types of states, namely, persistent and transient, that have a high impact on the observation likelihood. An HMM consisting of high-impact states is highly specific, in the sense that it differs substantially in observation likelihood from the best HMM with one state less.

A learning algorithm, called SCT, was constructed based on this analysis which correctly identifies highly specific models. But even for low-specific models, the identification accuracy is still around 88%. The algorithm is about two orders of magnitude faster than the traditional Baum-Welch algorithm.

#### Appendix

#### A. Proof of Theorem 9

We prove that the presence of transient-cyclic states with dominant observations can be identified through the division defined in (14) under the conditions as follows. Note that we consider that the relative frequency is close to the true probability such that the following derivations apply:(i) *If **, there are only states with dominant observations.* One type of HMM cases is the basic transient, cyclic model with dominant and privileged observation value. Without loss of generality, we assume , , and are dominant and privileged observation values for states , , and and that the transition cycle is . So for the emission probabilities, or and for the state transition probabilities, Note that if , we have . For cyclic indices, operations are always followed by a modulo operation. We assume that the probabilities (i.e., transition and observation probabilities) can be split into two groups: large and small probabilities. There is a large deviation between both; that is, large probabilities are much higher than small probabilities. For instance, large and small probabilities are around 0.9 and 0.1, respectively. Thus for the case addressed previously, and are large probabilities, which we denote as and , respectively, for simplicity. Similarly, and are denoted as and , respectively. Thus and . Moreover, we assume that and with , we have With (A.1)–(A.4), the following holds: The approximation in (A.5) holds because the terms of the sum for and are small probability factors. Since is two orders lower, it follows that . If we assume and train the observations with a first-order Markov model, then we have which can be approximated by a first-order HMM as shown in Figure 13(a); thus If we assume and train the observations with a second-order Markov model, the following holds: which can be approximated by a second-order HMM as shown in Figure 13(b); thus The first-order HMM assumption counts twice of emission probability ; that is, an larger probability factor of is calculated in (A.8) than in (A.10). Thus the division and the calculated probability with a second-order HMM assumption in (A.10) is higher than that with a first-order HMM assumption in (A.8).(ii) *If **, there are states with dominant observations and an extra mixing state.* Another type of HMM cases is the basic transient, cyclic model with mostly dominant and privileged observation value, but also with mixing observations. For demonstration purpose, we assume there exists one mixing state in the model, which emits observations and with equal probability , where . We call a medium probability because it is close to or equal to a probability of 0.5. Thus the first-order Markov model assumption holds: