Abstract

Due to the complexity of the network structure, log analysis is usually necessary for the maintenance of network-based distributed systems since logs record rich information about the system behaviors. In recent years, numerous works have been proposed for log analysis; however, they ignore temporal relationships between logs. In this paper, we target on the problem of mining informative patterns from temporal log data. We propose an approach to discover sequential patterns from event sequences with temporal regularities. Discovered patterns are useful for engineers to understand the behaviors of a network-based distributed system. To solve the well-known problem of pattern explosion, we resort to the minimum description length (MDL) principle and take a step forward in summarizing the temporal relationships between adjacent events of a pattern. Experiments on real log datasets prove the efficiency and effectiveness of our method.

1. Introduction

With the increasing demand for computing power, many network-based distributed systems have emerged, such as the popular distributed storage system and HDFS. Distributed system utilizes multiple machine nodes to complete tasks based on the network. Since the network may be complex and each node in the network may report anomalies, to maintain the network-based distributed application, instead of monitoring each node of the network, experts usually analyze the node logs to evaluate the health of the system. Logs record the running status and significant events such as the starting and ending of a task of the system. Mining information from log sequences is a useful way to understand network-based distributed system behaviors.

Many researchers have proposed different log analysis approaches in recent years. Some try to sketch the operation process of the system [13], and some devote to functional tasks, such as anomaly detection and problem diagnosis [4, 5]. In this paper, we treat the timestamped log messages as an event sequence produced by a black-box system. We target on the problem of mining informative patterns from temporal log sequence data. Here, by informative, we mean a set of patterns that can summarize the log sequence well. The discovered patterns can help the operation engineers to have a better understanding of the system behaviors and serve as an excellent source of information for online monitoring and anomaly detection. Our work is motivated by the well-known distributed file system HDFS. Many informative patterns are generated during system operation, corresponding to different system behaviors.

As the centerpiece of the HDFS, one important function of the NameNode is to determine the mapping of blocks of a file to DataNodes when a new file is written to the HDFS. For the NameNode, the write operation regularly generates a pattern as shown in Table 1. Normally, when a new data block is allocated, three copies of this block are created in different DataNodes, and after creating required replicas, NameNode updates the block map. This typical system behavior can be represented by a sequential pattern .

Based on capturing this type of patterns, we can further implement several log analysis applications. For example, with the increasing amount of log data generated by the complex system, it is rather difficult for engineers to check the log messages one by one. Compressing the log data to a smaller size without losing the important information is of high demand. Since the pattern represents the semantic of the data block allocating behavior, we can reduce the number of logs greatly by encoding the original log sequence with the discovered patterns. Besides, we can investigate the historical condition of the cluster by counting the number of newly added data blocks in the last week.

Sequential pattern mining has become an important data mining problem for decades. Dozens of research efforts focus on the problem of finding sequential patterns effectively [68]. However, traditional mining approaches often generate a huge number of patterns that confuse users a lot. To solve the well-known problem of pattern explosion [9], we resort to the minimum description length (MDL) principle [10], which has been used in several previous works [9, 11, 12]. The MDL principle provides a good balance between the complexity of the result pattern set and its ability to represent the data.

Although our work is not the first work that discovers patterns by utilizing the MDL principle, we take a step forward in summarizing the temporal relationships among events of a pattern. Logs monitor repetitive behaviors corresponding to execution traces of several program statements. It is a common case that there exist significant temporal regularities in system logs. Consider an example in Figure 1, which shows simplified log messages for two behaviors of the NameNode. We use capitals to represent event types and lowercase characters to represent corresponding occurrences. There exist two patterns: , where is generated after every 2 or 5 seconds and is generated after every 3 seconds. , where the interval between is 1 or 2 seconds. Obviously, patterns summarizing the temporal relations between adjacent events indicate more information of the running status of the system.

However, none of these methods fully consider the temporal relationships. From the perspective of handling temporal relations between adjacent events of a pattern, existing approaches can be divided into two groups. GoKrimp [9] and SQS [12] punish gaps by allocating higher cost when encoding patterns with large gaps, which do not consider the regularity of interarrival times at all. CSC [11] considers patterns restricted to fixed intervals and does not allow any event type appearing more than once, which strongly limits the expressiveness of a pattern. The generated patterns of previous methods either mix together and result in high redundancy [9, 12] or are too simple to represent the true behaviors [11].

In this paper, we propose an approach DTS (discovering patterns from temporal log sequences) to remedy the defects. To the best of our knowledge, our work is the first work proposed to encode the temporal regularity of patterns in an event sequence. The encoding scheme enables us to discover high-quality patterns with low redundancy.

The key contributions of our work include the following:(1)We formalize how to use histograms to describe the distribution of time intervals between adjacent events in a pattern. Moreover, we design an encoding scheme that compresses the original sequence as well as the temporal regularities with the mined patterns in a lossless way.(2)We introduce a heuristic algorithm DTS that discovers a set of informative patterns both effectively and efficiently.(3)Evaluation results on real datasets show that our methods are capable of discovering high-quality patterns with low redundancy.

The rest of this paper is organized as follows. In Section 2, we give the preliminary knowledge about our work, including the pattern semantic and the problem statement. The encoding scheme for temporal data is described in Section 3. In Section 4, we present the algorithm for mining patterns that best compress the log sequence. We evaluate the effectiveness and efficiency of our approaches in Section 5. We discuss related work in Section 6 and conclude this paper in Section 7.

2. Preliminaries

2.1. Log Sequence and Log Parsing

A log sequence is a sequence of log entry and timestamp pairs, denoted as , where is the -th log entry and is the corresponding timestamp. The granularity of timestamps can be set to any level depending on the application.

Raw logs are usually unstructured, free-text messages. To analyze log data, a common preprocessing step is to parse unstructured log messages into structured representations [13, 14]. Concretely, log messages printed by the same statement are often highly similar to each other and vice versa [13]. Based on this observation, we can extract two parts of information from a log message: event type and parameters. Event type refers to the constant content shared by all logs generated by the same statement, while parameters are the values of variables that differ from log to log. In this paper, we use Drain [14] to parse the original log messages due to its popularity and excellent performance. For each log message , we store its event type and parameters with a structured representation, denoted as a tuple , where refers to an instance of the event type of , refers to the values of the parameters, and is the number of parameters in . We assume that the event types of all logs in come from a finite alphabet . Therefore, the original log sequence is parsed into a temporal event sequence . We use to denote the time spanned by , i.e., . For brevity, we omit the parameters when unnecessary.

2.2. Pattern Discovery with the MDL Principle

In this work, our goal is to turn the raw log sequence into a sequence of more informative patterns. Specifically, we aim to discover a set of informative patterns which can (1) represent event relationships and temporal regularities and (2) provide the best lossless compression of the event sequence. This is achieved based on the minimum description length (MDL) principle [10].

2.2.1. A Brief Introduction to MDL

We use MDL as a metric to balance data compression quality and complexity. Specifically, we apply the two-part MDL principle, which can be roughly described as follows. Given a set of models , the best model is the one that minimizes , where is the description length of and is the description length of sequence when encoded with . The description length is computed at the bit level. In this paper, the model refers to a set of patterns . We will define the encoding scheme as to how to describe with a set of patterns and how to compute the description length later. Thus, our problem of pattern discovery can be formulated as follows.Problem Definition. Given a temporal log sequence , find the set of patterns that minimizes the description length .

2.2.2. Semantics of the Pattern Language

To use MDL, we first need to specify the pattern language, which determines the “vocabulary” of all possible patterns we can discover given . Specifically, the key elements of a pattern are defined as follows:(1)Content: the content of is denoted as , where , which can be used as a unique identifier of each pattern. The length of the episode of is .(2)Occurrence (set): an occurrence of pattern in is denoted as , a list of events ordered by time , where with denoting an instance of event type . The occurrence set of a pattern is denoted as . In this paper, we consider the leftmost occurrences of .(3)Support: the support of a pattern is the number of its nonoverlapping occurrences, denoted as . Two occurrences of are nonoverlapping if they do not share common events. We also do not allow overlapping among occurrences of different patterns.(4)Time interval distribution: given the occurrence set of , we use a histogram to describe the distributions of time intervals between adjacent events and in . Specifically, the bin counts the number of time intervals that equal .

3. Data Encoding Scheme

In this section, we present our encoding scheme and the way we compute description lengths. It is worth noting that we are only interested in code length functions instead of actual encodings [10]. All logarithms in this paper are to base 2.

3.1. Encoding the Pattern Set

Under the MDL criterion, we need to first encode the model, which is the pattern set in our paper. Here, we use to store the elements of all patterns, which comprise the items in the header row of Figure 2. The event sequence can then be compressed by replacing the occurrences of patterns with the corresponding codes. For each pattern, we need to encode its episode and histograms which describe the distribution of time intervals between adjacent events. Although most of the events in are covered by patterns, some events can be left out. Each of such events is covered by a special pattern with only one item in its content (namely, the type of the event), which is denoted as a which has no histogram.

As with description length computation, first we discuss how to calculate . We use bits to describe the content of each event type. Therefore, episode of length needs bits to describe. We use a unique code, denoted as , to encode each pattern. Intuitively, frequent patterns should have shorter description lengths. Hence, the length of the binary code depends on the pattern’s frequency , denoted as , where  =  is the sum of all patterns’ support. Here, we consider optimal prefix codes, where the code length is  = −.

As with the histogram , let the number of nonempty bins in be . Since the bin width is set to the time unit of , the more regular each value of the time interval between adjacent events in is, the less diverse these values are and the smaller is. Consider the value of the ratio . A smaller value of indicates better temporal regularity between and . This observation allows us to design an encoding scheme that compresses the temporal regularity of pattern ’s occurrences.

To be specific, the value of a nonempty bin is denoted as  = , where is the difference between the timestamps of two adjacent events and is the number of intervals in . We use number of bits to encode and number of bits to encode . We also need extra bits to identify each nonempty bin, whose length is dependent on the size of . Concretely, the more intervals that equal , the shorter the code length, which is . Thus, the information of a histogram can be described using bits. To sum up, the description length of a pattern in is

Figure 2 gives a concrete example on how the description length of a pattern is computed. For all patterns in , the total description length is

3.2. Encoding the Event Sequence

Given the encoded pattern set , we can encode the event sequence by replacing the occurrences of patterns in with their codes. Since is a temporal sequence, for each occurrence of a pattern, we need to cover all events in it as well as their timestamps. In our encoding scheme, we replace an occurrence with the code of the corresponding pattern and the timestamp of its first event. The description length of a timestamp is . Moreover, we use the codes of histogram bins to specify the time intervals between adjacent events. We individually encode each bin in . The resulting encoding is . For the remaining events that are not covered by any pattern, we simply use singleton patterns to encode them, i.e., .

As a concrete example, we encode the sequence in Figure 1 using the pattern set in Figure 2. Here, is encoded by replacing the original events and timestamps in each occurrence of each pattern with the corresponding codes. For example, is an occurrence of , and it is replaced by in the encoded sequence.

With this encoding scheme, the description length of sequence encoded by is calculated as follows:

4. DTS: Discovering Patterns from Temporal Sequences

In this section, we introduce our DTS algorithm for pattern discovery. The problem of discovering a set of patterns that best compresses based on MDL has been proven to be NP-hard [9]. Therefore, we resort to discovering informative patterns heuristically instead of pursuing the optimal result. We assume sequence is encoded by singleton patterns at the very beginning. Our DTS algorithm starts from an empty pattern set and updates iteratively until we cannot achieve no more compression.

4.1. Overall Process of DTS

We now present an overview of our DTS algorithm. A naïve strategy is to update the pattern set by inserting the best pattern (the one that can achieve the largest gain in compression) in the current sequence . However, for system logs, it is common that the execution of a task generates several different branches. For example, Figure 3 shows a simplified process of an HDFS client writing files. Each time the client finishes writing a data block, it will call addBlock() to allocate a new block. Each time the DataNode receives a new block, it will report to the NameNode, thus generating a sequence . Also, the NameNode sometimes calls fsync() to persist the cached data, thus triggering the log event . Therefore, another sequence can also be generated during the writing process.

The common part of these two sequences, i.e., , can be easily discovered and inserted into the set . Nonetheless, it is much harder to recognize together with . In practical maintenance applications, we hope to identify the behavioral patterns more completely. This cannot be achieved using the aforementioned naïve strategy.

Faced with this problem, we divide the update operations of into two types: the insertion operation and the refinement operation. Their definitions are as follows.

Definition 1. (insertion operation). Insert the newly discovered candidate pattern into . The pattern set is updated to .

Definition 2. (refinement operation). Given a pattern (), we refine it by adding events of type at position of . The result is a new pattern with length . Specifically,(1)If , (2)If , (3)If , The refinement operation is denoted as , where is called as the refining event. The pattern set is updated to , where denotes the remaining original pattern. If there are remaining occurrences of the original , the operation is called a partial refinement; otherwise, it is called a full refinement, and then is . Figure 4 presents an example of the refinement operation.

The pattern set is updated iteratively by either inserting a new pattern to or refining a pattern that is already in . It is common that an event covered by a candidate pattern is also a refining event for another pattern . We refer to such a case as an event conflict. In each iteration, if there is no event conflict, we can greedily add the best candidate pattern to . However, if there exists an event conflict, since our encoding scheme does not allow overlap among different patterns, it is necessary to decide which update operation is better. Here, we choose the operation that decreases the description length more. After choosing an update operation, we encode the events covered by the new patterns with the corresponding codes. The pattern set is updated iteratively until we cannot decrease the description length any more.

The overall process of DTS is divided into two stages: the update stage driven by the event conflict and the update stage containing only refinement. We maintain two data structures to support the two update operations: a candidate pattern set for insertion and a candidate refinement map for refinement. stores the candidate patterns in the current sequence dynamically, while stores the possible refinements for each pattern in . As shown in Figure 5, at the first stage, DTS iteratively compares the compression gain of the candidate pattern and the refinement result, selecting the better one to update . When becomes empty, namely, we cannot get a valid candidate pattern any more, DTS continues with the second stage, iteratively updating with the best refinement in until is empty. We now move on to the technical details of our DTS algorithm.

4.2. Candidate Pattern Set

Concerning the insertion operation, we hope the pattern added to will best decrease the description length. To this end, we define the compression gain of a pattern as follows:

A naive way of selecting the best candidate pattern is to enumerate all possible patterns exhaustively. However, this is time-consuming due to the exponential search space. In practice, instead of searching from scratch in each iteration, we maintain a candidate pattern set that stores promising candidate patterns. We initialize with Algorithm 1. For each event type , best growth () generates a pattern starting from a singleton pattern by appending any event type to the end of greedily. Since it is computationally intensive to find the optimal occurrence set having the fewest histogram bin, we resolve to a simple approach that finds the leftmost occurrences of . If the newly generated pattern yields greater compression gain than the previous one, we move on to test the next pattern created by a new growth. The process stops until we cannot get a new pattern better than the previous one. We select the last pattern as the best pattern starting with and add it to .

Input: singleton event set and sequence
Output: candidate pattern set
(1)functionINITIALIZE () (, )
(2)
(3)for do
(4)  create a singleton pattern
(5)  
(6)  while () do
(7)   
(8)  end while
(9)  ifthen
(10)   add to
(11)  end if
(12)end for
(13) sort the patterns in by
(14)return
(15)end function
4.3. Candidate Refinement Map

Considering the refinement operation, we maintain a candidate refinement map to store the best refinements for each pattern . Each refinement follows the format . For example, indicates that a pattern is partially refined by event at position 1, generating a new pattern with some occurrences of the original pattern remaining. On the contrary, if is fully refined by , the refinement result will be written as . The refinements in are indexed by the refining event type. In other words, the refinement results with the same event type are grouped together.

When the pattern set is updated by a new pattern , which is either selected from or generated by a refinement operation, we decide whether there exists a good refinement for each position by Algorithm 2. For each position , we only store the refinement that best decreases the encoded description length. The compression gain is defined in the following equation:

Input: pattern , sequence , singleton event set and candidate refinement map
(1)function OBTAINREFINEMENTS (, , , )
(2)fordo
(3)  
(4)  for do
(5)   REFINE ()
(6)   ifthen
(7)    
(8)   end if
(9)  end for
(10)  ifthen
(11)   add to
(12)  end if
(13)end for
(14)end function

The way we obtain a refinement REFINE () is as follows. The content of the newly refined pattern can be easily obtained by Definition 2. The occurrences of can be obtained from the original occurrence set of . For each occurrence , we find the leftmost event instance in that meets the time restriction and refine with it.

4.4. The Complete DTS Algorithm

In this section, we present the complete DTS algorithm for heuristic discovery of the pattern set . As is shown in Algorithm 3, we update iteratively until we cannot decrease the description length of any more. At the beginning, we initialize the relevant data structures. Lines 4–17 correspond to the update stage driven by the event conflict. During this stage, DTS first gets the best candidate pattern in . Then, DTS calls the function BestConflictRefinement (, ) to check if there exist one or more refinements in that conflict with and retrieve the best one . DTS decides which one is better to update the pattern set by comparing the compression gains of and with the following equation:

Input: sequence
Output: pattern set
(1)functionDTS ()
(2) all event types in ,
(3)INITIALIZE(, ),
(4)while true do
(5)  BESTCANDIDATEPATTERN()
(6)  ifthen
(7)   break
(8)  end if
(9)  BESTCONFLICTREFINEMENT (, )
(10)  if or then
(11)   
(12)   UPDATE, UPDATE, ENCODE (, )
(13)  else
(14)   
(15)   UPDATE, UPDATE, ENCODE (, )
(16)  end if
(17)end while
(18)while (BESTCANDIDATEREFINEMENT ()) do
(19)  
(20)  UPDATE, UPDATE, ENCODE (, )
(21)end while
(22)return
(23)end function

Note that we compare the compression gains of the two operations at the event level in order to avoid bias towards patterns with longer episodes or greater frequency. A positive value of indicates that has greater compression gain. Otherwise, is better. If is better or there is no conflicting refinement, will be updated to ; otherwise, it will be updated to .

When the function BestCandidatePattern () returns null which indicates that there is no candidate pattern in the current sequence any more, DTS moves on to the next stage. Lines 18–21 correspond to the update stage containing only refinement. Here, DTS updates with the best refinement in iteratively.

To illustrate, consider the example in Figure 6, in which the best current candidate pattern is , and there exist two conflicting refinements in , where is selected using BESTCANDIDATEREFINEMENT (). If has greater compression gain, pattern set will be updated to ; otherwise, pattern will be added to , resulting in .

The data structures and are maintained dynamically. After each update, we delete invalid elements in them and adjust the patterns that are influenced by the update.

5. Experiments

In this section, we conduct extensive experiments to verify the effectiveness of our approach. All experiments were executed on a machine with Intel i7-3770 CPU @3.4 GHz, 24 GB of RAM, and Window 10 OS.

5.1. Datasets and Settings
5.1.1. Datasets

We use four different log datasets collected from real systems. The basic statistics of these datasets are summarized in Table 2. The Zookeeper and OpenStack datasets are collected from the well-known loghub data repository [15], while the NameNode and DataNode logs are collected from our own HDFS cluster.

These datasets have different characteristics. The NameNode logs have more event types and more complex system behaviors interleaving together as a NameNode manages multiple DataNodes. The DataNode and Zookeeper logs have less event types as well as simpler and more regular behaviors. The OpenStack logs contain only behaviors such as creating a project and other simple tasks, thus having the least event types.

For each log dataset, we calculate the ratio of event types which occur less than 50 times. As is shown in Table 2, three out of four datasets contain a high percentage of low-support event types (more than 65%). To discover nontrivial patterns in these datasets effectively, we need to set the support threshold to a low value, which can lead to the pattern explosion problem. We will later show that DTS can discover nontrivial patterns while maintaining a low level of redundancy.

5.1.2. Rival Approaches

We compare our approach against four approaches, GoKrimp [9], SQS [12], CSC [11], and ISM [16]. The first three approaches use the MDL principle [10] to discover patterns, while ISM is a probabilistic machine learning approach. We have not included SeqKrimp [9] in our experiments because GoKrimp yields similar results and is more efficient than SeqKrimp. For CSC and our approach, we simply use the entire sequence as the input. For other approaches, the sequences are broken into disjoint sequences of size 10. Our algorithms are implemented in Java, while the implementations of other approaches are obtained from the original authors.

5.2. Evaluation of Efficiency

In the first experiment, we compare the efficiencies of all approaches as the length of the log sequences varies on the NameNode and DataNode datasets. The results are shown in Figure 7. It is worth noting that although the results of all 5 approaches are plotted in the same figures, the time unit for CSC, GoKrimp, and our approach is in seconds, while that for SQS and ISM is in minutes.

As is shown, CSC is the faster than all other approaches because it has a gap constraint which restricts the size of the search space greatly. However, this constraint dictates that CSC cannot find complex patterns effectively, which will be shown in the next experiment. DTS is slower than GoKrimp as GoKrimp applies a dependency test for speedup. Nonetheless, this prevents GoKrimp from discovering low-frequency patterns.

Although DTS takes comparatively longer time, it only needs less than 400 seconds to process a complex sequence with 600,000 events, which is acceptable in real applications. Besides, the running time of DTS increases steadily with the increase of log size, indicating good scalability. On the contrary, SQS and ISM are much slower, with processing times ranging from dozens of minutes to several hours due to their complex iterative computations.

The running times on the Zookeeper and OpenStack datasets are shown in Table 3, which shows similar results to those on the NameNode and DataNode logs. CSC is still the most efficient, while DTS comes next.

5.3. Evaluation of Pattern Quality

In this section, we analyze the quality of the patterns returned by different approaches from three aspects: the compression ratio , the event coverage , and the redundancy among different patterns.

5.3.1. Evaluation of the Compression Ratio

According to the MDL principle, we can encode the sequence with the discovered pattern set. The smaller the description length is, the stronger the expressiveness of the encoding pattern set is. Here, we use compression ratio [9] (denoted as ) to measure the effectiveness of the encoding. is defined as the ratio of the description length of the encoding with only singleton patterns to that with the discovered patterns. The higher the ratio is, the better the expressiveness is. Note that here we only compare the compression ratios for MDL-based approaches.

Figure 8 shows the compression ratios on all four datasets. As indicated, DTS generally achieves the best compression ratio, thanks to our effective encoding scheme as well as the proposed refinement operations and the heuristic strategy based on the “event conflict.” Among the other 3 comparison algorithms, the iterative searching strategy of GoKrimp is too simple to yield expressive results. Therefore, it generally achieves the lowest compression ratio. CSC has a good performance on the DataNode and Zookeeper logs, yielding similar performance to DTS. This can be attributed to the fact that CSC is suitable for processing log data in which the time intervals between adjacent events are short and constant. On the contrary, for log sequences such as NameNode logs which contain complex patterns, CSC tends to perform poorly. As with SQS, thanks to its complex search strategy, it performs well on more complex data. The problem with SQS is that it generates too many patterns, and thus, the model used to describe the data is too complicated, resulting in greater description length. This can negatively impact the compression ratio of SQS, especially when the patterns in the log data are simple.

It is worth noting that while our approach has generally higher compression ratios, this does not mean that it results in more redundancies, on which we will discuss later.

5.3.2. Evaluation of the Event Coverage

In this section, we evaluate whether the patterns returned by different approaches cover various event types adequately. We use the event coverage [17] (denoted as ) as the metric, which measures the percentage of event types covered by the discovered patterns. A larger indicates better performance.

Figure 9 demonstrates the event coverage of all 5 approaches on the four log datasets. As is shown, except for GoKrimp, all other approaches have event coverage over 80%. By contrast, while GoKrimp achieves an of 91% on the OpenStack dataset, it fails to achieve an of over 50% on the remaining datasets. This is related to the proportion of rare event types in the log data. As was mentioned before, the search heuristic of GoKrimp is oversimplified, which prevents GoKrimp from discovering patterns with infrequent event types. Therefore, the more infrequent event types contained in the log data, the lower the event coverage of GoKrimp is. By contrast, this factor has little effect on the other five algorithms.

Overall, our DTS achieves the best coverage on the three of the four log datasets. However, it is slightly inferior to SQS and CSC on the OpenStack logs. This is because there are three event types in the OpenStack dataset which occur randomly in the sequence, exhibiting no significant temporal regularity. Therefore, DTS does not consider these event types as part of any pattern, while CSC and SQS greedily include these event types in some patterns. This showcases the advantage of DTS, which takes the temporal regularities of patterns into consideration, reducing the number of invalid patterns made up of random combinations of unrelated events. This can lead to a far less redundant result than CSC and SQS.

5.3.3. Evaluation of the Pattern Redundancy

The goal of this work is to mine high-quality and low-redundancy pattern sets from log sequences. In previous sections, we have evaluated the pattern qualities using compression rate and event type coverage. We now shift our attention to the evaluation of pattern redundancy. Here, we utilize the following four metrics:(1)Average intersequence edit distance (): this metric measures the edit similarity among the discovered patterns. Concretely, the of a pattern set is calculated as follows: for each pattern in set , we calculate the edit distance between and each of the remaining discovered patterns and normalize the distance with the pattern ’s length. We record the minimum distance between and all other patterns. is obtained by averaging these minimum distances for all patterns in the pattern set. The larger the distance is, the less redundant the pattern set is.(2)Average count of supersets (): this metric measures the diversity of the event types contained in the discovered pattern set. Concretely, the of a pattern set is calculated as follows: for each pattern in , we denote the event type set contained in as . For each , we calculate the number of remaining patterns whose event type set is the superset of . is obtained by averaging this number for all patterns in the pattern set. For each pattern, the less patterns whose event type set is its superset, the less redundant the pattern is. Therefore, the smaller the ACS is, the more diverse the event types are and the less redundant the pattern set is.(3)Number of patterns (): this metric measures the number of discovered patterns.(4)Average length of patterns (): this metric measures the average length of the discovered patterns.

Table 4 reports the results of four evaluation metrics on different datasets, respectively. As is shown, our approach has a larger intersequence edit distance and a smaller count of supersets on average, which indicate that the patterns generated by our approach have lower redundancy and higher diversity. In terms of the number of patterns, DTS has relatively small values while ensuring adequate coverage of event types and a high level of compression. Overall, the pattern sets discovered by DTS can express the meaningful information of a log sequence with low redundancy.

Among the four rival algorithms, GoKrimp yields relatively good performance, generating the smallest number of patterns with relatively low redundancy. However, as was previously shown, the expressiveness of GoKrimp is generally the worst among all methods. As a matter of fact, the search strategy of GoKrimp acts as a double-edged sword. On the one hand, it helps achieve compact results. On the other hand, it limits GoKrimp’s ability to mine meaningful patterns. Overall, this tradeoff is not desirable in practice. As with ISM, it has reasonable and values, yet it is inferior when compared against DTS. By contrast, the results of CSC on the Zookeeper and DataNode logs are less redundant, which again manifests the effectiveness of CSC for such type of log data. However, on the NameNode and OpenStack logs, the redundancy of CSC is high. Especially, the values of are more than 10, namely, each combination of event types generates more than 10 patterns on average. As for SQS, it has a high degree of redundancy on every type of log data. This is because the encoding scheme of SQS only considers the support of the patterns and does not fully take into account the temporal regularity. Therefore, any sequential pattern generated by frequent behaviors can result in good compression, which leads to many redundant patterns.

As an important data source for system management, logs have been widely used in many tasks such as anomaly detection bib13[5, 13], program workflow modeling [1], failure diagnosis [18], and performance monitoring [19]. These works mostly focus on automatic system maintenance and diagnosis through log analysis. In this paper, we aim at discovering meaningful patterns from raw log sequences. The discovered patterns can provide useful information that can support the aforementioned tasks.

Sequential pattern mining was first introduced by Agrawal and Srikant [6]. Since then, various sequential pattern mining algorithms have been proposed, such as PrefixSpan [20], SPAM [21], and BIDE [7]. Traditional pattern mining approaches usually generate a huge number of redundant patterns. This problem is commonly known as pattern explosion. Several restrictions on pattern semantics have been proposed to tackle this problem, such as closed frequent patterns [22] and maximal frequent patterns [23]. However, these measures do not fully resolve the problem. In order to reduce pattern redundancy, modern sequential pattern mining approaches resort to the minimum description length (MDL) principle [10] to select a set of patterns that can best compress the data. KRIMP [24] pioneered the use of MDL in identifying good pattern sets, which mines itemsets that can well describe a transaction database. GoKrimp [9], SQS [12], and CSC [11] extend this methodology to the sequential pattern mining task. The basic idea of these approaches is to cover a sequence database with a set of patterns that can achieve the highest compression. Jianxin Li [25] proposed a parallel approach FCT to efficiently find frequent co-occurring terms in relational data.

However, these existing approaches do not take the temporal regularity of timestamped sequences into consideration. By comparison, our approaches can achieve better results by utilizing the event relationships in a pattern as well as the temporal regularities among events and are capable of discovering informative patterns with low redundancy. Instead of designing a MDL-based encoding scheme, ISM [16] presents a subsequence interleaving method based on a probabilistic model of the sequence database, which searches for the most compressed set of patterns. However, this approach, along with the aforementioned SQS, is slow according to our experiments due to its exhaustive nature.

7. Conclusions

In this paper, we have proposed a novel approach to discover sequential patterns from log sequences with temporal regularities, which process the log data in a black-box manner and do not need any domain knowledge on the system. Specifically, we have drawn on the MDL principle and formalized an encoding scheme that takes event relationships as well as temporal regularities into consideration. Based on this scheme, we have proposed DTS, an efficient heuristic algorithm, which greedily updates the pattern set. Extensive experiments on real datasets show that the proposed approaches can discover high-quality patterns efficiently.

Data Availability

The NameNode and DataNode datasets are available from the corresponding author upon request. The Zookeeper and OpenStack datasets are collected from the well-known loghub data repository (https://github.com/logpai/loghub).

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by MIIT Project (Data Management Standards and Verification for Industrial Internet Identifier Resolution).