Mathematical Problems in Engineering

Volume 2017, Article ID 7318940, 26 pages

https://doi.org/10.1155/2017/7318940

## Efficient and Effective Learning of HMMs Based on Identification of Hidden States

^{1}Department of Electronics and Informatics (ETRO), Vrije Universiteit Brussel (VUB), Pleinlaan 2, 1050 Brussels, Belgium^{2}Department of Industrial Sciences (INDI), Vrije Universiteit Brussel (VUB), Pleinlaan 2, 1050 Brussels, Belgium^{3}Department of Data Science, iMinds, Technologiepark 19, 9052 Zwijnaarde, Belgium

Correspondence should be addressed to Tingting Liu; eb.buvorte@uilt

Received 24 July 2016; Revised 21 December 2016; Accepted 29 December 2016; Published 23 February 2017

Academic Editor: Leonid Shaikhet

Copyright © 2017 Tingting Liu and Jan Lemeire. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

The predominant learning algorithm for Hidden Markov Models (HMMs) is local search heuristics, of which the Baum-Welch (BW) algorithm is mostly used. It is an iterative learning procedure starting with a predefined size of state spaces and randomly chosen initial parameters. However, wrongly chosen initial parameters may cause the risk of falling into a local optimum and a low convergence speed. To overcome these drawbacks, we propose to use a more suitable model initialization approach, a Segmentation-Clustering and Transient analysis (SCT) framework, to estimate the number of states and model parameters directly from the input data. Based on an analysis of the information flow through HMMs, we demystify the structure of models and show that high-impact states are directly identifiable from the properties of observation sequences. States having a high impact on the log-likelihood make HMMs highly specific. Experimental results show that even though the identification accuracy drops to 87.9% when random models are considered, the SCT method is around 50 to 260 times faster than the BW algorithm with 100% correct identification for highly specific models whose specificity is greater than 0.06.

#### 1. Introduction

Hidden Markov Models (HMMs) [1] are one of the statistical modelling tools showing great success and have been widely used in diverse application fields such as speech processing [2], machine maintenance [3], acoustics [4], biosciences [5], handwriting and text recognition [6], and image processing [7]. Despite the merit of simplicity and learning capabilities, HMMs are still facing open problems such as learning effectiveness and efficiency.

There are two major problems in HMM learning: (1) choosing model size (number of hidden states); (2) estimating model parameters. Regarding the first problem, state-of-the-art approaches normally train multiple HMMs with different numbers of states and the best one is selected using specific criteria (e.g., the Akaike information criterion (AIC) [8], the Bayesian Information Criterion (BIC) [9]). In order to tackle the second problem, traditional learning algorithms such as the Baum-Welch (BW) algorithm are used to iteratively optimize model parameters starting from , most often randomly chosen, initial set of parameters. Such iterative optimization heuristic approaches are prone to local optima. Therefore, multiple runs (typically, 10 [10, 11] or 20 [12, 13]) with several different initializations are performed and the optimal one of these is chosen. However, such iterative approaches with multiple trainings have significant drawbacks of time inefficiency and a high computational cost. Hsu et al. [14] introduced a noniterative method employing spectral-based algorithm for learning HMMs. It is simple and employs only a singular value decomposition and matrix multiplications. Nonetheless, it is evaluated in [15] and shown to be only applicable to identify systems when relatively few observations are available but fail completely for systems when the available observations are large. Fox et al. [8] proposed a sticky HDP-HMM which is a nonparametric, infinite-state model that automatically learns the size of state spaces and the smoothly varying dynamics robustly. However, this approach is computationally prohibitive when datasets are very large [9]. Therefore, in spite of the limitations, classical iterative approaches are still widely used to estimate model size and model parameters, for lack of alternatives.

The aim of this paper is to improve the effectiveness and efficiency in model learning compared to the conventional BW algorithm, in the sense of accurately and quickly finding the correct model. One of the HMM assumptions is that the observed data is only dependent on the hidden states given the model. Therefore, the observed data often reflects the structure and statistical properties of the model, which motivates us to introduce a data-driven preestimation procedure to estimate the number of states and choose proper initial model parameters.

We firstly provide insight into the essential features of an HMM model that help to improve the model’s expressiveness as a stochastic process [16]. This is conducted by inspecting the role of each hidden state in generating observation distributions as well as providing information on the model structure. Hidden states with a large influence on observation sequences increase the value of a model more than those without or with low influence. By analysing how the information flows through the HMMs, we determine which cases make a state have a high impact. As discussed in Section 3, persistent and/or transient-cyclic states appear to be high-impact states. Moreover, a model with high-impact states is highly* specific* and will be easy to identify. We introduce the term* specificity* as the minimum model distance between a model and the best of HMMs with one state less. On the contrary, some HMMs are in principle unidentifiable which has been proved in [17] by linking the learning of HMMs to the nonlearnability results of finite automata. Furthermore, there are models in between the learnable and the unlearnable HMMs, which are hard to learn from observation sequences. Such HMMs contain complex parameter configurations with low specificity and low-impact states. Overall, experimental results show that a better number of states and proper initialization learned by the proposed method increase the learning speed and accuracy of highly specific HMMs compared to the traditional Baum-Welch algorithm.

The remainder of the paper is organized as follows: in Section 2, the preliminaries about HMMs and the Baum-Welch learning problems are briefly reviewed, followed by the concepts and definitions of model characteristics such as model identifiability, model equivalence, and the minimality of models. In Section 3, the impact of states on model specificity is studied through the information analysis. Followed by the approximate identification framework in Section 4, experiments and results are discussed in Section 5. Finally, conclusions are given in Section 6.

#### 2. Preliminaries

An HMM [1] is a doubly stochastic process where the underlying process is characterized by a Markov chain and unobservable (hidden) but can be observed through another stochastic process which emits the sequence of observations. Let denote the number of states and the number of observation symbols. Let and denote the set of states and the set of observations, respectively. Using and to represent the state and the emitted observation at time , respectively, the state and observation sequences are denoted by vectors and , where , and is the number of states or observations in the sequence. A discrete time HMM model can be characterized by the quintuple [1]: the initial state probability distribution is a column vector , where the th element isthe state transition probability distribution matrix is , where the th element isand the observation probability distribution matrix is , where the th element isTo note that the state transition probabilities of state include both* incoming* and* outgoing* probabilities, the* incoming* state transition probabilities of are the th column vector of , denoted asand the* outgoing* state transition probabilities of is the th row vector of , denoted aswhere and represents the set of nonnegative real numbers.

##### 2.1. The Baum-Welch Learning Algorithm

One of the three basic problems for HMMs is the learning problem [1], which is often solved by an Expectation-Maximization (EM) algorithm [18], named the Baum-Welch algorithm [19, 20]. Starting with an initial guess of the model at random, the model parameters are iteratively reestimated as long as the new model has an increased likelihood compared to the previous one; that is, , where and represent the likelihood values of an observation sequence generated by the previous model and the newly updated model , respectively. This procedure continues until the likelihood converges to a stationary point. However, the BW algorithm suffers from the problem of getting stuck at a local optimum if the initial model parameters are not well chosen, which inspires this study to search for a better estimation of the initial parameters.

For the analysis, we need to calculate the likelihood of observations given the model, that is, . It can be written by the use of the projection operations; see, for instance, [16, p. 18]. Let and , wherethuswhere and which denotes the diagonal matrix of which the diagonal elements are the th column of .

Therefore, the* likelihood* of the observations given the model can be expressed aswhere is a column vector of length with all entries equal to 1; that is, . For the convenience of calculations, the logarithm of likelihood* log-likelihood* (LL) is often used rather than the likelihood. Moreover, in this dissertation, we use* unit log-likelihood*, an averaged LL, to present the LL per single observation, that is, , where is the number of observations. Within this paper, the term* log-likelihood* is used to represent* unit log-likelihood* for simplicity.

##### 2.2. Definitions of Model Characteristics

In this paper, we determine the learnability of HMMs through model identifiability. If two models are equivalent, the true model cannot be uniquely identified. Hence we firstly introduce the definition for model equivalence. Note that the HMM learning can be considered as a probability distribution specific problem, where every HMM has to be identified from the observations generated according to its own likelihood distribution. Therefore, the equivalence of HMMs can be defined based on their observation likelihood distributions as follows.

*Definition 1 (HMM equivalence). *Two HMM models and are* equivalent* if and only if both models have the same observation emission probabilities (i.e., likelihood distribution over time series) for every observation sequence alternatively,

Note that the observation probabilities can remain the same by permuting the states of since the states can be arbitrarily labeled. The model with permuted states is called a* trivial equivalent* model of the original model as defined in [21]. We consider* trivial equivalent* models as the same model. In order to compare the models in later sections, we need to label states in a unique way such that corresponding states receive the same label. Therefore we define a process to normalize HMMs as follows.

*Definition 2 (HMM normalization). *For each state , a score is calculated by . Based on the score, we sort the states in ascending order.

Additionally, we can always construct an equivalent HMM with additional state numbers [22]; hence, in this paper, we consider HMM identifiability only when it is* minimal*, as defined below.

*Definition 3 (HMM minimality). *An HMM is* minimal* if and only if it has equal number of states to or fewer number of states than any other equivalent model ; that is, . Model is called a* simpler* model of if they are equivalent and .

*Definition 4 (HMM identifiability). *An HMM is* identifiable* if and only if it is minimal and there does not exist any* nontrivially equivalent* model with an equal number of states; that is, .

Moreover, in this study we only address the identification of* stationary* (or* homogeneous*) HMMs where the prior probabilities can be eliminated in calculations. The initial state prior probability distribution has an influence on learning only at the beginning of an observation sequence and its impact on large sequences vanishes over time and thus can be excluded for learning HMMs in practice. A stationary HMM is defined as follows.

*Definition 5 (stationary HMM). *An HMM is* stationary* if its state distribution remains the same at every time instant; that is, , where equals the equilibrium state distribution; that is, [23, p. 4902].

The element is a column vector with , and . The element represents the probability of going from state to state while emitting the observation by state , that is, .

Our proposed learning approach is based on the properties of observation sequences that make a state have a large impact on the model. To describe the degree of influence that a state can make on a model, we define a new term called* specificity * as the* distance* between model and the* best* model with one state less. By* best*, we mean that it matches the most on observations generated by the original model among all the one-state-fewer models, which also means that it has the minimum model distance to the original model . A general definition of* model distance* is as follows.

*Definition 6 (HMM model distance). *A model distance between two HMMs and is the difference of the unit log-likelihood of an observation sequence [1, p. 271]:where refers to the expectation operator, is an observation sequence generated by model , and is the size of the sequence. Equation (11) is basically a measure of how well model matches observations generated by model , in comparison with how well model matches observations generated by itself [1]. The* specificity* of a model can be then defined as follows.

*Definition 7 (HMM specificity). *The* specificity* of an HMM with states iswhere represents the set of all HMMs with states and is the length of an observation sequence generated by . We denote the optimal model with the minimum distance to model in (12) as .

We have to note that, to use Definitions 6 and 7 in practice, we will calculate the expectation with a single generated observation sequence. We assume that this sequence is long enough such that it is a typical sequence and gives a stable value which comes close to the expected value and as such is independent of the exact sequence, as is done by Rabiner [1].

To use the above definitions on a limited set of observation sequences, we have to rely on an approximate equivalence approach. In order to compare the HMMs according to the likelihood probability given a set of observation sequences , we have to define a threshold on the model distance to decide whether two HMM models are equivalent or not.

*Definition 8 (distance threshold of equivalent HMMs). *The* distance threshold* is defined aswhere is the asympototic distribution of log-likelihood with , the element represents randomly generated sequences by model , is the length of an observation sequence, and is the total number of observation sequences [24]. Duan et al. [24] prove that the distribution of the log-likelihood can be approximated by a normal distribution . According to the “three-sigma” rule, the interval contains 99% of the whole distribution. Thus a sequence has a certainty of being generated by the model if its log-likelihood . As defined in Definition 1, two models are equivalent if and only if both models have the same likelihood distribution on observations. Hence for any sequence generated by model , if has a log-likelihood within the interval, that is, , we can say the two models are approximately equal. Therefore, the model distance threshold of equivalence is approximated as of the reference model for practical use.

As defined in Definition 3, a model is* minimal* if and only if it has equal number of states to or fewer number of states than any other equivalent models. In order to check model minimality in practice, we verify if there exists no one-state simpler model which is equivalent to model , in particular, to verify if the minimum distance between and (i.e., the specificity of ; see Definition 7) is outside the threshold of equivalent models defined in Definition 8. Therefore, the practical condition to check* model minimality* is defined as follows: a model can be approximately taken as* minimal* if the absolute value of its* specificity* is outside the* distance threshold* of 3-sigma; that is, .

#### 3. Impact of States on Observation Likelihood

We start the study through an information flow analysis as to see the impact of different types of states on model specificity.

##### 3.1. Information Flow Analysis

Our aim is to understand which parameters make an HMM have a higher specificity. However, an analytical equation for the specificity function requires us to know the optimal one-state-simpler model , which is still an open problem. This leads us to an alternative approach by analysing* state properties* of models. In the following analysis, we will study which properties make up a* high-impact* state and which do not. A* high-impact* state makes itself more specific with a significant influence on ; thus it emits relatively unique patterns of observation sequences which can be distinguished from other states. Using this analysis, we will in this paper propose a framework to identify the* high-impact* states.

To study what influences the* specificity* of an HMM, we analyse the impact of a state on the likelihood and how it contributes to as follows. Consider in (10). It can be seen as a probability used in predicting the future from the past and it represents the information flow from the past to the future. Hence we will analyse the contribution of a specific state to this probability. There are three cases whereby the probability of the state plays a role in the information flow, as shown in Figure 1:(a)The present state probability depends on the previous state probability and partly determines the observation probability .(b)The present state probability depends on the observations and determines the succeeding state probability. The observation probability depends on which is updated with the knowledge of .(c)The present state probability is determined by the past state probability and affects the future state probability.