Research Article  Open Access
Tingting Liu, Jan Lemeire, "Efficient and Effective Learning of HMMs Based on Identification of Hidden States", Mathematical Problems in Engineering, vol. 2017, Article ID 7318940, 26 pages, 2017. https://doi.org/10.1155/2017/7318940
Efficient and Effective Learning of HMMs Based on Identification of Hidden States
Abstract
The predominant learning algorithm for Hidden Markov Models (HMMs) is local search heuristics, of which the BaumWelch (BW) algorithm is mostly used. It is an iterative learning procedure starting with a predefined size of state spaces and randomly chosen initial parameters. However, wrongly chosen initial parameters may cause the risk of falling into a local optimum and a low convergence speed. To overcome these drawbacks, we propose to use a more suitable model initialization approach, a SegmentationClustering and Transient analysis (SCT) framework, to estimate the number of states and model parameters directly from the input data. Based on an analysis of the information flow through HMMs, we demystify the structure of models and show that highimpact states are directly identifiable from the properties of observation sequences. States having a high impact on the loglikelihood make HMMs highly specific. Experimental results show that even though the identification accuracy drops to 87.9% when random models are considered, the SCT method is around 50 to 260 times faster than the BW algorithm with 100% correct identification for highly specific models whose specificity is greater than 0.06.
1. Introduction
Hidden Markov Models (HMMs) [1] are one of the statistical modelling tools showing great success and have been widely used in diverse application fields such as speech processing [2], machine maintenance [3], acoustics [4], biosciences [5], handwriting and text recognition [6], and image processing [7]. Despite the merit of simplicity and learning capabilities, HMMs are still facing open problems such as learning effectiveness and efficiency.
There are two major problems in HMM learning: (1) choosing model size (number of hidden states); (2) estimating model parameters. Regarding the first problem, stateoftheart approaches normally train multiple HMMs with different numbers of states and the best one is selected using specific criteria (e.g., the Akaike information criterion (AIC) [8], the Bayesian Information Criterion (BIC) [9]). In order to tackle the second problem, traditional learning algorithms such as the BaumWelch (BW) algorithm are used to iteratively optimize model parameters starting from , most often randomly chosen, initial set of parameters. Such iterative optimization heuristic approaches are prone to local optima. Therefore, multiple runs (typically, 10 [10, 11] or 20 [12, 13]) with several different initializations are performed and the optimal one of these is chosen. However, such iterative approaches with multiple trainings have significant drawbacks of time inefficiency and a high computational cost. Hsu et al. [14] introduced a noniterative method employing spectralbased algorithm for learning HMMs. It is simple and employs only a singular value decomposition and matrix multiplications. Nonetheless, it is evaluated in [15] and shown to be only applicable to identify systems when relatively few observations are available but fail completely for systems when the available observations are large. Fox et al. [8] proposed a sticky HDPHMM which is a nonparametric, infinitestate model that automatically learns the size of state spaces and the smoothly varying dynamics robustly. However, this approach is computationally prohibitive when datasets are very large [9]. Therefore, in spite of the limitations, classical iterative approaches are still widely used to estimate model size and model parameters, for lack of alternatives.
The aim of this paper is to improve the effectiveness and efficiency in model learning compared to the conventional BW algorithm, in the sense of accurately and quickly finding the correct model. One of the HMM assumptions is that the observed data is only dependent on the hidden states given the model. Therefore, the observed data often reflects the structure and statistical properties of the model, which motivates us to introduce a datadriven preestimation procedure to estimate the number of states and choose proper initial model parameters.
We firstly provide insight into the essential features of an HMM model that help to improve the model’s expressiveness as a stochastic process [16]. This is conducted by inspecting the role of each hidden state in generating observation distributions as well as providing information on the model structure. Hidden states with a large influence on observation sequences increase the value of a model more than those without or with low influence. By analysing how the information flows through the HMMs, we determine which cases make a state have a high impact. As discussed in Section 3, persistent and/or transientcyclic states appear to be highimpact states. Moreover, a model with highimpact states is highly specific and will be easy to identify. We introduce the term specificity as the minimum model distance between a model and the best of HMMs with one state less. On the contrary, some HMMs are in principle unidentifiable which has been proved in [17] by linking the learning of HMMs to the nonlearnability results of finite automata. Furthermore, there are models in between the learnable and the unlearnable HMMs, which are hard to learn from observation sequences. Such HMMs contain complex parameter configurations with low specificity and lowimpact states. Overall, experimental results show that a better number of states and proper initialization learned by the proposed method increase the learning speed and accuracy of highly specific HMMs compared to the traditional BaumWelch algorithm.
The remainder of the paper is organized as follows: in Section 2, the preliminaries about HMMs and the BaumWelch learning problems are briefly reviewed, followed by the concepts and definitions of model characteristics such as model identifiability, model equivalence, and the minimality of models. In Section 3, the impact of states on model specificity is studied through the information analysis. Followed by the approximate identification framework in Section 4, experiments and results are discussed in Section 5. Finally, conclusions are given in Section 6.
2. Preliminaries
An HMM [1] is a doubly stochastic process where the underlying process is characterized by a Markov chain and unobservable (hidden) but can be observed through another stochastic process which emits the sequence of observations. Let denote the number of states and the number of observation symbols. Let and denote the set of states and the set of observations, respectively. Using and to represent the state and the emitted observation at time , respectively, the state and observation sequences are denoted by vectors and , where , and is the number of states or observations in the sequence. A discrete time HMM model can be characterized by the quintuple [1]: the initial state probability distribution is a column vector , where the th element isthe state transition probability distribution matrix is , where the th element isand the observation probability distribution matrix is , where the th element isTo note that the state transition probabilities of state include both incoming and outgoing probabilities, the incoming state transition probabilities of are the th column vector of , denoted asand the outgoing state transition probabilities of is the th row vector of , denoted aswhere and represents the set of nonnegative real numbers.
2.1. The BaumWelch Learning Algorithm
One of the three basic problems for HMMs is the learning problem [1], which is often solved by an ExpectationMaximization (EM) algorithm [18], named the BaumWelch algorithm [19, 20]. Starting with an initial guess of the model at random, the model parameters are iteratively reestimated as long as the new model has an increased likelihood compared to the previous one; that is, , where and represent the likelihood values of an observation sequence generated by the previous model and the newly updated model , respectively. This procedure continues until the likelihood converges to a stationary point. However, the BW algorithm suffers from the problem of getting stuck at a local optimum if the initial model parameters are not well chosen, which inspires this study to search for a better estimation of the initial parameters.
For the analysis, we need to calculate the likelihood of observations given the model, that is, . It can be written by the use of the projection operations; see, for instance, [16, p. 18]. Let and , wherethuswhere and which denotes the diagonal matrix of which the diagonal elements are the th column of .
Therefore, the likelihood of the observations given the model can be expressed aswhere is a column vector of length with all entries equal to 1; that is, . For the convenience of calculations, the logarithm of likelihood loglikelihood (LL) is often used rather than the likelihood. Moreover, in this dissertation, we use unit loglikelihood, an averaged LL, to present the LL per single observation, that is, , where is the number of observations. Within this paper, the term loglikelihood is used to represent unit loglikelihood for simplicity.
2.2. Definitions of Model Characteristics
In this paper, we determine the learnability of HMMs through model identifiability. If two models are equivalent, the true model cannot be uniquely identified. Hence we firstly introduce the definition for model equivalence. Note that the HMM learning can be considered as a probability distribution specific problem, where every HMM has to be identified from the observations generated according to its own likelihood distribution. Therefore, the equivalence of HMMs can be defined based on their observation likelihood distributions as follows.
Definition 1 (HMM equivalence). Two HMM models and are equivalent if and only if both models have the same observation emission probabilities (i.e., likelihood distribution over time series) for every observation sequence alternatively,
Note that the observation probabilities can remain the same by permuting the states of since the states can be arbitrarily labeled. The model with permuted states is called a trivial equivalent model of the original model as defined in [21]. We consider trivial equivalent models as the same model. In order to compare the models in later sections, we need to label states in a unique way such that corresponding states receive the same label. Therefore we define a process to normalize HMMs as follows.
Definition 2 (HMM normalization). For each state , a score is calculated by . Based on the score, we sort the states in ascending order.
Additionally, we can always construct an equivalent HMM with additional state numbers [22]; hence, in this paper, we consider HMM identifiability only when it is minimal, as defined below.
Definition 3 (HMM minimality). An HMM is minimal if and only if it has equal number of states to or fewer number of states than any other equivalent model ; that is, . Model is called a simpler model of if they are equivalent and .
Definition 4 (HMM identifiability). An HMM is identifiable if and only if it is minimal and there does not exist any nontrivially equivalent model with an equal number of states; that is, .
Moreover, in this study we only address the identification of stationary (or homogeneous) HMMs where the prior probabilities can be eliminated in calculations. The initial state prior probability distribution has an influence on learning only at the beginning of an observation sequence and its impact on large sequences vanishes over time and thus can be excluded for learning HMMs in practice. A stationary HMM is defined as follows.
Definition 5 (stationary HMM). An HMM is stationary if its state distribution remains the same at every time instant; that is, , where equals the equilibrium state distribution; that is, [23, p. 4902].
The element is a column vector with , and . The element represents the probability of going from state to state while emitting the observation by state , that is, .
Our proposed learning approach is based on the properties of observation sequences that make a state have a large impact on the model. To describe the degree of influence that a state can make on a model, we define a new term called specificity as the distance between model and the best model with one state less. By best, we mean that it matches the most on observations generated by the original model among all the onestatefewer models, which also means that it has the minimum model distance to the original model . A general definition of model distance is as follows.
Definition 6 (HMM model distance). A model distance between two HMMs and is the difference of the unit loglikelihood of an observation sequence [1, p. 271]:where refers to the expectation operator, is an observation sequence generated by model , and is the size of the sequence. Equation (11) is basically a measure of how well model matches observations generated by model , in comparison with how well model matches observations generated by itself [1]. The specificity of a model can be then defined as follows.
Definition 7 (HMM specificity). The specificity of an HMM with states iswhere represents the set of all HMMs with states and is the length of an observation sequence generated by . We denote the optimal model with the minimum distance to model in (12) as .
We have to note that, to use Definitions 6 and 7 in practice, we will calculate the expectation with a single generated observation sequence. We assume that this sequence is long enough such that it is a typical sequence and gives a stable value which comes close to the expected value and as such is independent of the exact sequence, as is done by Rabiner [1].
To use the above definitions on a limited set of observation sequences, we have to rely on an approximate equivalence approach. In order to compare the HMMs according to the likelihood probability given a set of observation sequences , we have to define a threshold on the model distance to decide whether two HMM models are equivalent or not.
Definition 8 (distance threshold of equivalent HMMs). The distance threshold is defined aswhere is the asympototic distribution of loglikelihood with , the element represents randomly generated sequences by model , is the length of an observation sequence, and is the total number of observation sequences [24]. Duan et al. [24] prove that the distribution of the loglikelihood can be approximated by a normal distribution . According to the “threesigma” rule, the interval contains 99% of the whole distribution. Thus a sequence has a certainty of being generated by the model if its loglikelihood . As defined in Definition 1, two models are equivalent if and only if both models have the same likelihood distribution on observations. Hence for any sequence generated by model , if has a loglikelihood within the interval, that is, , we can say the two models are approximately equal. Therefore, the model distance threshold of equivalence is approximated as of the reference model for practical use.
As defined in Definition 3, a model is minimal if and only if it has equal number of states to or fewer number of states than any other equivalent models. In order to check model minimality in practice, we verify if there exists no onestate simpler model which is equivalent to model , in particular, to verify if the minimum distance between and (i.e., the specificity of ; see Definition 7) is outside the threshold of equivalent models defined in Definition 8. Therefore, the practical condition to check model minimality is defined as follows: a model can be approximately taken as minimal if the absolute value of its specificity is outside the distance threshold of 3sigma; that is, .
3. Impact of States on Observation Likelihood
We start the study through an information flow analysis as to see the impact of different types of states on model specificity.
3.1. Information Flow Analysis
Our aim is to understand which parameters make an HMM have a higher specificity. However, an analytical equation for the specificity function requires us to know the optimal onestatesimpler model , which is still an open problem. This leads us to an alternative approach by analysing state properties of models. In the following analysis, we will study which properties make up a highimpact state and which do not. A highimpact state makes itself more specific with a significant influence on ; thus it emits relatively unique patterns of observation sequences which can be distinguished from other states. Using this analysis, we will in this paper propose a framework to identify the highimpact states.
To study what influences the specificity of an HMM, we analyse the impact of a state on the likelihood and how it contributes to as follows. Consider in (10). It can be seen as a probability used in predicting the future from the past and it represents the information flow from the past to the future. Hence we will analyse the contribution of a specific state to this probability. There are three cases whereby the probability of the state plays a role in the information flow, as shown in Figure 1:(a)The present state probability depends on the previous state probability and partly determines the observation probability .(b)The present state probability depends on the observations and determines the succeeding state probability. The observation probability depends on which is updated with the knowledge of .(c)The present state probability is determined by the past state probability and affects the future state probability.
(a)
(b)
(c)
3.2. HighImpact States
We now investigate the highimpact states on likelihood , more specifically on the specificity . Such states should have a high and unique impact on the likelihood where high means a high information flow passing from the past to the future states and unique ensures that no other states can fill in the same role, such that it cannot be mimicked by other states either with combined similar probabilities or emitting similar observation probabilities. For instance, a state with a probability of 0.5 can be mimicked by a combination of two states with probabilities of 0.1 and 0.9, respectively; or a state with observation emission probabilities of 0.5 is also not unique. Note that a relatively high or low probability is more difficult to be mimicked than 0.5 in the previous examples. Hence for the three cases outlined in Figure 1, the state plays an intermediate role in predicting the future based on the past; we can define the following conditions for highimpact state, respectively:(a)(1) The incoming transition probabilities (see (4)) of state at time are maximal or minimal; that is, ; (2) state has a dominant observation at time , meaning the observation probability (see (3)) is maximal; that is, .(b)(1) The outgoing transition probabilities (see (5)) of state at time are maximal or minimal; that is, ; (2) state has a dominant observation at time ; refer to condition a(2).(c)Refer to conditions a(1) and b(1).
For high specificity, the above conditions should be met for all states of a model. Note that these conditions are based on state transition and observation probabilities. Regarding transition probabilities, a highly specific HMM should contain persistent and/or transientcyclic states, as defined below:(i)A persistent state is a state with a higher selftransition probability than the probabilities to transit to other states. When all states of an HMM are persistent, the HMM remains for a certain period in one state before changing into another state. Such HMM is called a persistent HMM.(ii)A transient state, on the other hand, has a lower selftransition probability. A transientcyclic state has one specific incoming transition probability which is high and dominant and one outgoing transition probability which is high. When all states of an HMM are transientcyclic, the HMM flips from one state to another, mostly following a certain pattern (e.g., ). Such HMM is called a transientcyclic HMM. Otherwise, it is called a transientacyclic HMM.(iii)When an HMM contains both persistent and transientcyclic states, we call it a hybrid HMM.
Secondly, regarding observation probabilities, a highly specific HMM should contain privileged states, which is defined as follows: A privileged state is a state with at least one dominant observation probability.
HMMs containing only privileged states are called privileged HMMs. This is possible when the number of observations is larger than the number of states; that is, .
Considering both transition and observation probabilities, we define a highly specific HMM as an HMM containing only persistent states and/or transientcyclic states, which will be shown as identifiable from observation sequences. Note that it is impossible to identify all minimal HMMs, especially when the influence of some states on a model is low, in the sense that such states can be neglected and the resultant simpler model is comparable to a complex one. In order to learn a minimal identifiable HMM, we propose in a later section an effective and efficient model approximation method which identifies persistent states with segmentation and clustering methods and transientcyclic states with a transient analysis based on the following theorem.
Theorem 9. The presence of transientcyclic states with dominant observations can be identified as follows: for values of , if and , where , represents the relative frequency (i.e., the ratio of the number of times) of event occurring in the observed sequence, which is also the predicted probability of the occurrence of event ; then for(a)if , that is, , , the triple does not reveal hidden transientcyclic states and thus it can be modelled by a 1order Markov model,(b)if , the triple reveals that hidden transientcyclic states are present:(i)If , the triple reveals states with dominant observations.(ii)If , the triple reveals states with dominant observations and an extra mixing state.
The proof is in Appendix A.
The definitions of a Markov model and a mixing state used in the theorem are given as follows:(i)A Markov model is a stochastic process that is characterized by a Markov chain. It models the observed states with a random variable which satisfies the Markov property; that is, the distribution of the current state depends only on that of the previous state instead of the whole historical states. The state transition probability distribution and the initial state probability distribution are denoted by the same expressions as the HMM defined previously. The model can be written as .(ii)A mixing state is a state which outputs the same observation probabilities as a mixture of other states. HMM models containing mixing states are problematic, since one state has the same output distribution as a convex mixture of some other states’ output distribution; therefore it is difficult to distinguish the ground truth state between a single state and a mixture of several states [14].
3.3. Equivalent States
Now we try to understand when a state has zero impact on the specificity such that in the extreme case a simpler HMM exists with the same distributions. Considering the information flow , for the first arrow, the influence of a state is negligible when (1a) is close to zero; (1b) the state has an equal influence as another state if the probability equals that of another state; or (1c) the influence of the state can be mimicked by the other state if the probability is constant. Note that if it is neither constant nor the same as another state, the state probability will fluctuate which makes that its influence cannot be incorporated into that of other states. For the second arrow, the influence of the state can be incorporated into that of other states if (2a) is the same as the probabilities of another state or (2b) the probability distribution is not dominant.
In case (1a) the state plays no role and can be removed, in cases (1b) and (2a) the state can be merged with a similar state, and in cases (1c) and (2b) the influence of the state can be “taken over” by some of the remaining states. This leads to the conditions for eliminating redundant (i.e., equivalent) states as shown in Table 1. Note that the difference between “removal” and “taken over” is that, by removing a state, its information is removed together with the state, while “taking over” a state means that even though the state is deleted, its information stays and is passed to other states instead.

Based on the conditions of equivalent states defined in Table 1, we now can formalize the results of our analysis in sufficient conditions for nonminimality HMMs as follows.
Theorem 10. A stationary HMM is not minimal if one of the following conditions holds:(i)The HMM contains a state that has zero incoming state transition probabilities; that is, .(ii)The HMM contains two states and that have the same state transition probabilities; that is, and .(iii)The HMM contains two states and that have the same observation probabilities and meets one of the following conditions: (1) they have the same incoming state transition probabilities; that is, ; (2) they have the same outgoing state transition probabilities; that is, ; or (3) and .(iv)The HMM has two observation values and contains a state that has constant incoming state transition probabilities; that is, and for all , has nondominant observation probabilities; that is, .
The proof is in Appendix B.
3.4. LowImpact States
Unlike highimpact or equivalent (zeroimpact) states, some states have largerthanzero but very low impact, which makes them hard to learn. Such states are called lowimpact states. HMMs containing these states are called hard to learn HMMs, as will be shown later.
Since lowimpact states are in between highimpact and equivalent states, they meet a combination of partial conditions defined for both cases. As introduced in Section 3.2 for highimpact states, a learnable HMMs should contain only persistent and/or transientcyclic states with privileged observations, while an unlearnable HMMs contains states which contains one or two states under conditions defined in Theorem 10. Therefore, combined partial conditions of both can be defined for hard to learn HMMs.
An HMM is hard to learn if it contains mostly persistent or transientcyclic states with privileged states with dominant observations and is also under one of the following conditions:(i)There exists a mixing state whose observation distribution is a mixture of the observation distributions of two other states and ; that is, , where .(ii)There exists a state with constant incoming transitions, selfincluded; that is, , where .(iii)There exists a state with constant incoming transitions, selfexcluded; that is, , where .(iv)There exists a state with constant outgoing transitions, selfincluded; that is, , where .(v)There exists a state with constant outgoing transitions, selfexcluded; that is, , where .(vi)There exist two states and with the same observation probabilities , where .(vii)There exists a state with constant (nondominant) observation emissions; that is, , where .
4. Approximate Identification Algorithm
An HMM is either identifiable or unidentifiable. In order to describe how hard it is to identify a model, we use the term learnability: for an identifiable HMM, it can be easy, moderate, or hard to learn. Thus, before presenting the approximate identification algorithm, we firstly explain our hypothesis on the correlations between model learnability and specificity as shown in Figure 2, which will be validated experimentally in Section 5. HMMs containing states with higher specificity have higher distances with less complex models and as shown later are easier to learn, and vice versa. Therefore, we classify HMMs into three identification categories based on their specificity: (1) learnable HMMs with relatively high specificity; (2) hard to learn HMMs with low specificity; and (3) unlearnable HMMs with almost zero specificity. Our focus is to identify learnable and highly specific models with highimpact states.
4.1. Algorithm Structure
Based on the previous analysis of the hidden states, we can construct an algorithm that identifies highimpact states directly from the observation sequences. Inspired by signal processing method such as Empirical Mode Decomposition (EMD) and waveletbased denoising methods [25], which decompose the noisy signal into a number of components, filter each component, and finally reconstruct the denoised signal using the filtered components, here we reassemble the above procedures as follows: an unknown HMM is composed out of a number of hidden states. These states can be identified from observations and combined to form a reconstructed , as shown in Figure 3. In such manner, we decompose the model identification procedure into a combination of state identifications. The approximate state identification approach firstly recognizes persistent and transient states separately from observation sequences, then combines them into a set of identified states, and finally reduces or merges similar states into a new set of reconstructed states. The details of the identification framework will be explained as follows.
Models with highimpact states generate specific samples which are unique. Therefore, the states are identifiable through data analysis. The output of a persistent state is likely to stay in a period within which the behavior at each time step is “similar," which we call a regime. Thus a segmentation approach is advised splitting a signal sequence into regimes by identifying the specific behaviors within certain periods. In order to identify transientcyclic states, our approach is to capture the changing of behaviors with a transient analysis based on Theorem 9.
Therefore, we propose a framework to identify most highimpact and minimal states: (1) persistent privileged states; (2) transientcyclic privileged states; and (3) hybrid: persistent and transientcyclic privileged states. A schema of the proposed algorithm is shown in Figure 4. We assume that both persistent and transient states exist in the model; therefore, both segmentation and clustering and transient analysis methods are applied on the data, followed by a reparameterization procedure to combine the parameters learned from both the previous methods. Finally, a model reduction step is conducted in the end to form a simplified minimal HMM model. We name our proposed method as SegmentationClustering and Transient analysis (SCT) framework.
4.2. SegmentationBased Approach
The segmentationbased approach is defined using the following steps: Step 1: signals are split by segmentation techniques into different regimes with different signal behaviors; Step 2: the “similar” regimes of signals are grouped together by clustering techniques according to their similarities (the clusters are labeled and each cluster is a hidden state); Step 3: a clustering validation index is employed to determine the proper number of states; finally, Step 4: HMM parameters are estimated by calculating statistical occurrences of the observations and the estimated hidden states.
Step 1 (identification of persistent states by segmentation). Data sequences emitted by persistent states can be segmented into subsequences with constant behavior (observations are drawn from a stationary distribution). The transition from one state to another can be identified by detecting a difference in signal behavior. This is called a change point. In this paper, we propose a sliding windowbased Bayesian segmentation based on the test of [26]. The Bayesian probability is calculated to determine whether two sequences have been generated by the same or by a different multinomial model.
A multinomial model is a stochastic process where the observations follow a multinomial distribution. It is sufficient to model observations instead of states, since the observations in a multinomial model can represent states correspondingly with absolute state knowledge. Each observed symbol at time is independent and falls into one of categories with a fixed probability, denoted by , where . The observation probability distribution is denoted by , where . The model can be represented with a compact notation .
The first sequence always starts from the last change point (the first point if at the beginning) and ends at the current time point; the second sequence is a fixedlength sliding window starting from the next time point. If the two successive sequences are very likely from different models, the point in between is marked as a change point. The procedure repeats until the end of the signal.
Step 2 (combination of states by clustering). With HMMs, segments corresponding to the same state will recur over time. Assuming that there is a finite number of states, segments with the same states are detected and clustered together. In this study, the classical means clustering approach [27, 28] is chosen to group and label segments. In our case, the means clustering algorithm tries to group the segments into unique states based on the mean value of data features within each cluster, given by . Because of the fact that means clustering encounters the problem of randomness in selecting initial parameters, we perform a preliminary step for selecting centroid starting locations. The selected properties in the segmentation step are a onedimension sequence, which contains subsequences with equal length. The median values of the subsequences are then used as initial centroid locations.
Step 3 (cluster validity). In order to select the optimal number of clusters, we propose a constraintbased clustering analysis considering both the cluster separation capabilities of hidden states and the simplicity of HMM models. Constraint 1: lower DaviesBouldin index (DBI) [29] suggests that the clustering exhibited a better intracluster grouping and intercluster separation of each state. Constraint 2: instead of selecting the minimum DBI, an allowance with a threshold of 0.05 is given so that a smaller number of states will be selected if its DBI is within the range of .
Suppose dataset is partitioned into disjoint nonempty clusters and let denote the obtained partitions, such that (empty set), , and . The DaviesBouldin index [29] is defined aswhereindicate the intracluster diameter and the intercluster distance, respectively. The partition with the minimum DaviesBouldin index is considered as the optimal choice.
Step 4 (parameter estimations). Parameters of an HMM (i.e., probability matrices ) can be calculated by simply counting the occurrence of the observed signal and the hidden states (i.e., labels retrieved from clustering), which is the same calculation as the reestimation step of the BaumWelch algorithm [1]:
4.3. TransitionBased Approach
In this section we present a transitionbased approach in order to identify transientcyclic states. In order to estimate the observation matrix, we apply Theorem 9 which is dedicated to identifying transientcyclic privileged with or without mixing states. Firstly, the firstorder transition probabilities can be identified by a Markov model assumption via counting the occurrence of the observation sequences:Similarly, a secondorder transition probability can be modelled by an HMM assumption and calculated by counting:where . A threshold for dominant probabilities is calculated asIf the two continuous firstorder probabilities are dominant, that is, and , where , then the division of the secondorder transition probabilities calculated from a Markov chain and from an HMM assumption isIf , there is no transientcyclic states. Otherwise, the firstorder transition probabilities are taken as the dominant observation probabilities and used to build the observation matrix. If , a mixing state is present and one extra state is added to the observation matrix with a uniformly distributed probability of . In the end, we map each observation value , , and on a different state because we look for states with at least one dominant observation value. If a state has multiple observation values, they will be merged into one state. See Table 1 for conditions of model state reduction.
Take a simple 2observation case as an example; we generate a 10series sequence with length of 1000. The firstorder observation transition probabilities are . If the calculated occurrence probabilities are , the dominant transitions larger than are and . Thus, the dominant secondorder probabilities are , equal to calculated by a Markov model, while being equal to calculated by an HMM. Thus, the division of the two is . Since both probabilities are smaller than 1, they are dominant states and there is no mixing state. Therefore, we map the two dominant probabilities to the observation probabilities of two states and the final observation matrix is .
Furthermore, for calculating the prior and transition probabilities, we stick to the assumption that privileged state behaviors can be reflected by observation properties; therefore, we assume the number of states is the same as the number of observations and use a Markov model for learning state probabilities.
The prior and transition matrices can be calculated by counting the observation occurrences:Therefore, a model is learned containing only transientcyclic states.
4.4. Reparameterization
Parameters learned for both persistent and transientcyclic states are combined together by a procedure called reparameterization. Let be the parameters for persistent states and be the parameters for the transientcyclic states. Let and be the number of persistent and transientcyclic states, respectively. Thus the combined number of states is . The parameters of the combined model can be calculated aswhere the function Normalize ensures that the sum of the given vectors equals 1 and the Stochastic function ensures that the sum of each row of the given matrix equals 1.
4.5. Model Reduction
After combining the persistent and transientcyclic states, redundant states may occur. We introduce a model reduction procedure which removes redundant states to obtain minimal HMMs according to the conditions defined in Theorem 10. We relax the strict conditions given in the theorem via adding thresholds.(i) An HMM contains a state that has zero incoming state transition probabilities; that is, . Instead of using a zero vector as a strict rule, a threshold is defined to allow nearzero cases such that if the sum of the incoming transition state probabilities of state is lower than threshold , that is, the state can be removed.(ii) An HMM contains two states and that have the same state transition probabilities; that is, and . We replace the equivalence condition by a subtraction calculation. If the maximum of the incoming and outgoing state transition probabilities of the two states and is below a threshold , that is, the two states can be merged into one.(iii) An HMM contains two states and that have the same observation probabilities , and meet one of the following conditions: (1) they have the same incoming state transition probabilities; that is, ; (2) they have the same outgoing state transition probabilities; that is, ; or (3) and . Similar to (ii), we use subtraction instead of strict equivalence with added threshold . Moreover, the AND condition and the OR conditions can be represented by selecting the maximum and minimum values, respectively. Therefore, if the two states and can be merged.(iv) An HMM has two observation values and contains a state that has constant incoming state transition probabilities, and , and has nondominant observation probabilities, . where is the average of and the state can be “taken over.”
The selection of thresholds is conducted empirically since the correlation between the likelihood value and each condition is complex and is not our focus in this paper.
5. Experiments
Simulated data has been used to evaluate the effectiveness and efficiency of the proposed SCT inference framework. The simulated data were sampled from different classes of HMM models: nonminimal equivalent HMMs, identifiable (selected and random) minimal HMMs, and hard to learn HMMs.
5.1. Nonminimal Equivalent HMMs
Equivalent HMMs contains two cases: (1) two HMMs with the same number of states, where permutations of states apply to both models; (2) two HMMs with different numbers of states. This experiment focuses on case (2) and aims to test the model reduction conditions defined in Table 1, where a nonminimal HMM can remove, merge, or take over its redundant states to become an equivalent minimal HMM. One model is selected under each of the three reduction conditions and is used to construct an equivalent model by removing the redundant state (set as the last state here). The model parameters are listed hereafter, respectively: With a removable state: HMM : HMM : With a mergeable state: HMM : HMM : With a takenover state: HMM : HMM :
Each of the reference nonminimal models is used to generate 1000 datasets of random observations containing sequences of observation points. The datasets are used to determine loglikelihood distributions of the models and the distance threshold of equivalent models defined in Definition 8. By calculating the percentage of the loglikelihood values of the minimal model which fall inside the threshold of equivalence for the nonminimal model , we can obtain a confidence level. The results in Table 2 show that the two models are approximate equivalent models for all the three cases with high confidence levels. Moreover, the loglikelihood histograms are plotted in Figure 5. The highly overlapping histograms further demonstrate the model equivalence for the three examples.

5.2. Highly Specific HMMs
As discussed previously, persistent and transientcyclic HMMs with privileged observations are identifiable HMMs that have a high specificity. In this section, we compare the learning of such identifiable HMMs with the BaumWelch (BW) algorithm and the proposed SCT method.
Firstly, we constructed 9 persistent and 9 transientcyclic models as ground truth models with a fixed equal number of states and observations () ranging from 2 to 10. These models can be expressed as follows: Persistent : Transientcyclic :
The state selftransition probabilities for persistent models and the transition probabilities to the next neighboring state for transientcyclic models were both set to a value close to 1, noted as and , respectively, where . The remaining transitions have equal probabilities; that is, . For all the 18 models, the observation matrices are set the same as the transition matrices of persistent models in order to obtain privileged observations. The initial parameters are uniformly distributed.
We also generated 20 hybrid models as ground truth models containing both persistent and transientcyclic states. The number of states and for both cases was randomly chosen from 2 to 3, amounting to a total number of states within a range of . The rest of the parameters were generated in the same manner as before. For simplicity, we use to represent a matrix containing only element . Thus, a hybrid model can be represented as follows: Hybrid :
The experiments were carried out by using each of the constructed models as a reference model to generate a dataset of 10 series of 1000 observations. The first seven series were used as a training set and the last three series as a test set. The true number of states was assumed to be unknown and the learning methods have to select the number of states from a state pool of . For the BW learning algorithm, models were generated with a number of states ranging from 2 to and the model with the best is selected by the AIC criterion [30]. The learning of the BW algorithm was repeated 20 times to eliminate local optima and the one with the minimum AIC value is selected. In total, models were generated to determine an optimal model. Moreover, for comparison purpose, we also train BW with a given number of states ; therefore, a total of 20 models were generated and a best model is selected by the AIC criterion. On the other hand, for the proposed SCT method, the number of states is selected by the clustering validation method in Step 3 (See Section 4.2). Only one SCT model is trained and used for comparison the two best models selected by the BW method (with and without a given ).
In order to use the 3sigma rule to indicate if a true model is learned, 100 datasets of 10 series of 1000 observations were generated from each of the true models and used for calculating the loglikelihood distribution (see Definition 8). If the loglikelihood difference between the ground truth model and the learned model is outside the distance threshold of , we consider that the true model has not been found (i.e., a local optimum is learned). Moreover, for better understanding the loglikelihood results, we additionally calculated the loglikelihoods for the following models: (1) : the best model with states selected from 100 randomly generated models and trained with BW; (2) : the best model with states selected from 100 randomly generated models and trained with BW; (3) a multinomial model: the model assuming that there are no hidden states and the observations are the actual visible states. If an HMM model has similar loglikelihood as a multinomial model, the states have no impact on the model.
In addition to loglikelihoods, we define other performance indicators of accuracy as follows: (1) the percentage of convergence is the percentage of 20 BW learned models which did not fall into a local optimum; (2) the percentage of identification is the percentage of the best BW or the SCT learned models which did not fall into a local optimum; (3) the parameter distance is defined as the mean difference of the triples () between two HMM models. If the two models have different state space, values of 0s are filled into the probability matrices of the simpler model in order to have an equal number of states to the complex one. Moreover, all the permutations of the models are considered and the minimum distance is chosen as the parameter distance.
An average information of the ground truth models can be found in Table 3. The hybrid models are less specific than the persistent or transientcyclic states only models, which make them harder to identify. Detailed learning results can be found in Table 4. The results show that the proposed SCT method learns much faster than the traditional BW algorithm, with a speedup around 180 to 260 times. The BW algorithm tends to overfit the model by using a larger number of states, resulting in a higher parameter distance. Even when the true number of states is given, for persistent cases, the BW still cannot learn correctly, which has a larger testset loglikelihood difference and a lower convergence and identification rate.
 
Note: : mean of the loglikelihood distribution of the true model; : standard deviation of the loglikelihood distribution of the true model. 
 
Iters.: average number of iterations; Conv. : rate of convergence; Identi. : percentage of identification; : unit loglikelihood difference between the true models and the learned model on testsets; Para. Dist.: parameter distance; Pers.: Persistent; Tran.: Transientcyclic; Hybr.: Hybrid. Note that, for the BW, when calculating , , and Para. Dist., the learned model is the best one selected from the repeated random models. 
Learning results for a 10state persistent model, a 10state transientcyclic model, and a 6state hybrid model are used as examples for visualization. For the persistent model, the iterative learning process is shown in Figures 6(a) and 6(b). In Figure 6(a), the BW training was conducted with an unknown number of states , while in Figure 6(b), was given. Similarly, results for the transientcyclic model are shown in Figures 7(a) and 7(b). and the hybrid models are in Figures 8(a) and 8(b). The figures show that the proposed SCT method starts from a good initial model at the beginning and converges much faster than most of the 20 randomly initialized BW models which start from an almost equivalent level of the multinomial model for which hidden states play no roles. Moreover, although some of the best BW model converges in the end, the loglikelihood values are still not as good as the SCT method. We see that some of the repeated 20 models with states have been stuck in a local optimum with similar loglikelihood to model or .
(a) Loglikelihoods during iterative training, BW with unknown
(b) Loglikelihoods during iterative training, BW given correct
(c) Heatmap of HMM model parameters
(a) Loglikelihoods during iterative training, BW with unknown
(b) Loglikelihoods during iterative training, BW given correct
(c) Heatmap of HMM model parameters
(a) Loglikelihoods during iterative training, BW with unknown
(b) Loglikelihoods during iterative training, BW given correct
(c) Heat map of HMM model parameters
In order to compare the model parameters, heat maps of the original and inferred state transition and observation matrices are plotted in Figures 6(c), 7(c), and 8(c). A lighter color indicates a higher probability value close to one, while a darker color indicates a lower probability value close to zero. We notice that the BW method with an unknown learns a complex model with two more states than the true model for all the three cases, which is overfitting, while the SCT approach learns the state size correctly. Moreover, the SCT method has a onetoone correspondence of the high probabilities (in white/lightyellow) between the transition and observation parameter matrices, meaning the SCT trained model is almost equivalent to the reference true model. However, for the BW method, especially when is unknown, there are no onetoone relations in both transition and observation matrices, noticeable by some of the varied colors of heat maps from the true model. It means that some of the probabilities are wrongly learned.
5.3. Hard to Learn HMMs
For each of the seven hard to learn conditions defined in Section 3.4, we construct five ground truth models, resulting in a total of 35 models. For each model, persistent and transientcyclic state numbers are randomly generated from a range of . The privileged state probability is set randomly within a range of . The remaining probabilities are uniformly distributed. For conditions (ii), (iii), (iv), and (v), one extra state is generated accordingly to the specified conditions. For condition (i), state 3 is defined as a mixing state of states 1 and 2. For condition (vi), the first two states are set to have the same observations. For condition (vii), state 2 is set to have a constant observation emission probabilities. The rest of the experiment is set the same as previous experiments designed for identifiable models. Results show that only 31% of the 35 ground truth models are specific with an average specificity of 0.04. A detailed comparison of the learning results is presented in Table 5.
 
See abbreviations and notes in Table 4 for more details. 
From the results in Table 5 we can see that the BW algorithm is slower than the proposed SCT method with almost double learning convergence iterations. The SCT method is around 230 times faster than the BW algorithm. A positive average delta indicates that the BW method mostly overfits the true models with an average of 1.71 extra states, while the SCT has a negative delta indicating a slightly underfitting with an average of 0.66 fewer states. Even though the testset loglikelihood difference of the SCT is higher than the BW method, the average parameter distance further proves that the BW algorithm tends to overfit the models in order to have a lower loglikelihood. Moreover, the percentage of convergence reveals that the number of repetitions (e.g., 20 times in this experiment) is still necessary for the BW method to learn effectively, even with the tradeoff of longer learning time. Lastly, the SCT has a slightly lower but compatible identification percentage which has a significant learning speedup in return.
To visualize the results, we select two models under condition (i), a state being a mixing state, and condition (v), a state with constant selfexcluded outgoing transaction, as examples shown in Figures 9 and 10, respectively.
(a) Loglikelihoods during iterative training, BW with unknown
(b) Loglikelihoods during iterative training, BW given correct
(c) Heatmap of HMM model parameters
(a) Loglikelihoods during iterative training, BW with unknown
(b) Loglikelihoods during iterative training, BW given correct
(c) Heatmap of HMM model parameters
Figures 9 and 10 show that the BW algorithm with an unknown overfits the truth models with two extra states while the SCT method underfits the model with one state fewer where both the mixing state and the state with the same outgoing transactions in the two examples are merged into other states because they are not specific enough to be identified.
5.4. Random HMMs
In this experiment, we generated 10000 random HMM models configured with a combination of random and random . In order to guarantee that each HMM is minimal, we select models according to two criteria: (1) the model should have a higher testset loglikelihood than the one of a multinomial model; (2) the model compared to the best state model should not satisfy the threesigma rule for model equivalence criteria defined in Definition 8. A random HMM is discarded if it is not minimal. In the end, we obtain 149 specific minimal HMMs. The training procedure is conducted in the same way as in Section 5.2.
Experiment shows that the average specificity of the true models is 0.03, which is around 10 times less specific than the identifiable models used in the previous experiments in Section 5.2. Moreover, the mean of the loglikelihood distribution of the true model is 1.58, which is also much higher than the identifiable models. The above results indicate that random models are less specific and therefore less identifiable. A detailed comparison of the identification results is shown in Table 6.
 
See abbreviations and notes in Table 4 for more details. 
The results show that the SCT method needs in average one more iteration than the BW algorithm and the identification results are less adequate because the models are not specific enough to be estimated correctly. However, the speedup of the SCT method shows an improvement visavis the BaumWelch method, around 50 times. Both of the approaches overfit the models with an average of more than one state.
Figure 11(a) provides the dependence between true model specificity and testset loglikelihood difference with the true models. When the specificity is too low, the SCT method identifies less correctly the models. Thus, the less specific the model is, the harder it becomes for the SCT method to learn. For the purpose of comparison, we plot the same figure in Figure 11(b) but for highly specific models which were generated in Section 5.2. The results further confirm that, for highly specific models, when the specificity is relatively low, the SCT method outperforms the BW method. The loglikelihood differences of the models learned by BaumWelch have significantly increased indicating that completely wrong models are learned.
(a)
(b)
The results are expected because the SCT method is designed for highly specific models but not for random ones with less specificity. In order to see the influence of specificity for the SCT method to learn correctly, we plot the identification accuracy versus the specificity thresholds ranging from −0.01 to 0.2 with a step of 0.01 as shown in Figure 12. The models are selected when they have a specificity higher than a specificity threshold; then the percentage of correctly identified models within the selected models is used as the identification accuracy.