Summarizing Event Sequence Database into Compact Big Sequence

—Detecting the core structure of a database is one of the most objective of data mining. Many methods do so, in pattern set mining, by mining a small set of patterns that together summarize the dataset in efficient way. The better of these patterns, the more effective summarization of the database. Most of these methods are based on the Minimum Description Length principle. Here, we focus on the event sequence database. In this paper, rather than mining a small set of significant patterns, we propose a novel method to summarize the event sequence dataset by constructing compact big sequence namely, BigSeq. BigSeq conserves all characteristics of the original event sequences. It is constructed in efficient way via the longest common subsequence and the novel definition of the compatible event set. The experimental results show that BigSeq method outperforms the state-of-the-art methods such as Gokrimp with respect to compression ratio, total response time, and number of detected patterns.


I. INTRODUCTION
Detecting the key patterns from a database is one of the main objectives of data mining. There are many studies for mining all patterns that satisfy some constraints (such patterns may be frequent patterns as in PrefixSpan [11], CM-SPADE [19], PRISM [15], and [20], or closed patterns as in [24] [7][14] [2], or maximal patterns as in [3] [23]). Rather than mining all patterns, existing methods mining a set of patterns that is significat for summarizing the dataset. There are many methods to define this significant patterns. One of these methods is the Minimum Description Length (MDL) principle [21][12] [6] [22] which has proven to be particularly the winner one.It is based on the insight that any regularity in the dataset can be used to compress the dataset. Note that, the more we can compress, the more regularity we have found. More details about MDL are described in next section.
For itemsets data, Krimp [13] is based on MDL principle. For sequence data, the authors of SeqKrimp [8] [9], Gokrimp [8] [9], and SQS [18] used MDL principle to compress the sequence data. More details about these algorithms are illustrated in the related work section (Section III).
In this paper, we focus on the event sequence data. Our objective is to search for a summary of the given event data sequences. The size of this summary must be very small compared to the size of the event sequence dataset. Also this summary must converse the all characteristics of the original event sequences. The existing methods mine a significant patterns that compress the dataset well. Some of these methods generate the sequential patterns as a first phase. Then the significant patterns are selected with respect to MDL as a second phase. Note that the significant patterns is only a small subset of the set of all the sequential patterns and the process of mining all sequential patterns is very expensive process. Therefore, the other existing methods devise some effective pruning methods to prune the ineffective parts of the search space that do not contain any significant pattern. Unfortunately, the process of the pruning the ineffective parts of the search space consumes more time if it not used efficient techniques.
Contribution. From above, all existing methods apply the mining process to search for the significant patterns. In contrast, our proposed method donot apply the mining process. Instead of, all event sequences in the dataset are merged into only one compact big sequence. In other words, our proposed method detects only one significant pattern which is the compact big sequence. Note that, the detected big sequence must be compact as much as possible. Therefore, we introduce an efficient method for constructing the big sequence to reduce the size of big sequence as much as possible. The construction method is based on the longest common subsequence and the novel definition of the compatible event set. Our compact big sequence converses the all characteristics of the original event sequences via preserving the order of events as in the dataset and also associating with each event in the big sequence a list of sequence ids that contains this event. To confine the larger size of the lists of sequence ids, we can represent them as sets of bit-vectors. Here, the consecutive zeros in the sets of bits-vector are compressed in efficient way.
Organization. This paper is organized as follows. Section II defines the preliminary concepts. Section III presents the related work. Section IV presents our proposed algorithm. Section VI reports the experimental results. Finally, Section VII concludes the paper.

II. PRELIMINARY CONCEPTS
Let E = {e 1 , e 2 , ..., e m } be a set of m distinct events. Event sequence S =< u 1 , u 2 , ..., u l > over E is ordered list such that u i ∈ E. Event sequence W = {w 1 , w 2 , ..., w h } is subsequence of the event sequence S if there are h integers (j 1 , j 2 , ..., j h ) such that 1 ≤ j 1 < j 2 < ... < j h ≤ l and w 1 = s j1 , w 2 = s j2 ,... ,w h = s j h . Event sequence with length l is called an l-sequence. Event sequence database D = {S 1 , S 2 , ..., S n } is a set of event sequences where |D| = n. For example, consider Table I which contains an example of event sequence database D with |D| = 8. The sequence S 5 = ABCB is subsequence of the sequence S 1 = ABCBC (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 13, No. 8, 2022 (S 5 ⊑ S 1 ). Also we can said S 1 is supersequence of S 5 .  S1  ABCBC  S2  ABAA  S3  CABAC  S4  CAC  S5  ABCB  S6  CBAC  S7 BCAB S8 ACBBA Definition 2.1: Longest Common Subsequence. Given two event sequences X and Y , the longest common subsequence between X and Y denoted as lcs(X, Y ) is a longest sequence Z that is a subsequence of both X and Y .
Problem Definition: Given event sequence database D, the objective is to find a summary, S of D such that S conserves all characteristics of D and the size of S is sharply less than the size of D. (|S| << |D|).

III. RELATED WORK
In the beginning, we discuss the minimum description length in details as follows.

The Minimum Description Length
The minimum description length (MDL) principle [21][12][6] [22] widely used in text compression. It used as a method for selecting a set of compressive patterns. If these patterns are used as a dictionary then we have {the potential} to maximally compress the dataset into a compact pattern encoding. In other words, these patterns represent the dataset in efficient way. Unfortunately, the process of selecting such patterns that based on MDL is NP-hard problem. Given a set of models M, the MDL principle states that the winner model M ∈ M for the dataset D is the best model that provides the lossless compression. Formally, we optimize is the length in bits of the dataset when compressed with model M . MDL was applied to detect compressed frequent patterns from itemsets and sequences data. In next sections, we discuss the algorithms that based on MDL.
For itemsets data, there is algorithm called Krimp [13] that based on MDL principle. This algorithm is effective in solving the redundancy issue in the descriptive pattern mining. For sequence data, the authors of SeqKrimp [8] [9] used MDL principle to compress the sequence data. This algorithm contains two steps. The first step generates the sequential patterns as candidates by using existing sequence mining method. The second step greedily checks the candidate set to find the useful patterns which together minimizes the description length. The SeqKrimp algorithm has two main disadvantages which are the process of generating the candidates is expensive and the patterns that do not belong to candidate set have no chance to be selected even if they have ability to minimizes the description length.
The authors of Gokrimp [8][9] mine a set of nonredundant sequential patterns that compress the sequence data using the MDL principle. GoKrimp do not generate candidates as in SeqKrimp. Instead of, it directly mines compressed useful patterns by greedily extending a pattern until no additional compression benefit added. To taming the hardness of the checks for additional compression benefit of an extension, Gokrimp proposed a dependency test which only selects related events for extending a given pattern.
As in GoKrimp, SQS [18] also directly mines the compressed patterns from the sequence dataset. The patterns are constructed iteratively. In each iteration, the pattern is selected if it achieves the largest MDL gain among the possible patterns. Note that, each iteration requires at least one scan of the sequence dataset.

IV. PROPOSED ALGORITHM
The method is based on the observation that the most event sequences in real dataset share the same subsequences. To avoid the overhead of duplicated computations, we propose big sequence method that merges all event sequences in the dataset into one big sequence abbreviated as BigSeq. The construction method of BigSeq is one of main operations in our algorithm. BigSeq must be compact and efficient. At the same time, it must conserve all characteristics of the original dataset.
To construct compact BigSeq, we should propose an efficient method to reduce the size of BigSeq as much as possible. Thus, we will propose a new efficient method to construct BigSeq. Next we discuss the steps of the construction method on the sequence dataset of Table I. First, we select any sequence S in the sequence database, D (see Table I) as initial value of BigSeq. Suppose we selected the first sequence S 1 ∈ D. Then the BigSeq is ABCBC. As we will see, some events will be inserted into the current BigSeq to generate the final BigSeq. Therefore, we set a temporary index for each event in the current BigSeq as follows. The temporary indices of events in the current BigSeq will be We can assume the following i j = i j−1 + (j − 1).ϵ with 2 ≤ j ≤ 5 and ϵ ≥ 1. For example, i 3 = i 2 + 2ϵ After generating the final BigSeq, we will set the actual value for each temporary index, i j .
Second, for each remaining sequence S ′ in D, compute the longest common subsequence between S ′ and BigSeq, namely LCS(S ′ , BigSeq). After that we store the positions in BigSeq for each event that belong to LCS and store also the remaining events in S ′ , that do not belong to LCS(S ′ , BigSeq) (Note that these remaining events will be further inserted in BigSeq). For example, let S ′ be the sixth (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 13, No. 8, 2022 sequence, S ′ = S 6 = CBAC. We have LCS(S ′ , BigSeq) = LCS(CBAC, ABCBC) = CBC. The positions of the three events (C, B, and C) of LCS(S ′ , BigSeq) in BigSeq are i 3 , i 4 , and i 5 , respectively. We store these positions. Also, we store the remaining event, A, in S ′ that does not belong to LCS(S ′ , BigSeq). This remaining event, A, will be further inserted in BigSeq. The remaining events of each remaining sequence must be inserted in the correct position in the BigSeq. Therefore, we will associate with each remaining event e r a range of positions in BigSeq. We expect that e r will fall within this range in BigSeq. We call this range an Expected Range of Positions for event e r , namely ERP (e r ). Recall let S ′ = S 6 = CBAC then we have only one remaining event A. The position of the event A in S ′ falls between the positions of two events B and C. Note that these two events (B and C) belong to LCS(S ′ , BigSeq) and their positions in BigSeq are i 4 and i 5 . Therefore, we have As a consequence, we should insert the remaining event A in BigSeq at a new position between i 4 and i 5 . Table II shows the expected range of positions for each remaining event e r , ERP (e r ).
Finally, indeed, we do not insert each remaining event in BigSeq instead of we cluster the remaining events into compatible event sets. After that we insert only one represntative event, e rep , for each compatible event set into BigSeq at a specific position p. This position p must belong to the expected range of positions of every event in the compatible event set of e rep . See next definition of compatible event set and see next example.   Table I. Table II reports the initial value of BigSeq (S 1 [The first row]), the remaining sequences (S 2 , S 3 , S 4 , S 5 , S 6 , S 7 , S 8 [The first column]), the events of each remaining sequence S ′ that belong to LCS(S ′ , BigSeq) [The second column], and ERP (e r ) for any remaining event, e r (e r / ∈ LCS(S ′ , BigSeq)) [The third column].
Note that the underlined events in the first and second columns belong to LCS(S ′ , BigSeq) and the parameter δ ≥ 1. To distinguish among the remaining events (e r in the third column of Table II) that have the same label, we assign superscripts for these events as follows. A km means the m-th remaining event in the sequence k. Now we will determine the compatible sets of remaining events. The remaining event can be belonged to more than one compatible event set. In this case, we add this remaining event to only one compatible event set. From the definition of compatible event set, If two or more different remaining events belong to the same event sequence then we must add them to different sets of compatible events. For example, since the two different remaining events, A 21 and A 22 belong to the same event sequence (the second event sequence, S 2 ), they must be added to two different sets of comaptiable events. Based on the definition of compatible event set and ERP (e r ) in Table  II, we have three compatible sets of remaining events, core = {core 1 , core 2 , core 3 }, where core 1 = {A 21 , A 31 , A 41 , A 71 }, core 2 = {A 22 , A 32 , A 61 , A 82 }, and core 3 = {C 81 }.
Next we will discuss the computations of these three compatible sets of remaining events in details and how we conserve all characteristics of the original event sequence in the final BigSeq with respect to the event sequence database and ERP (e r ) in Tables I and II, respectively.
The first compatible set of remaining events is core 1 = {A 21 , A 31 , A 41 , A 71 }. Note that all events in core 1 have the same label, A (Condition 1 in Definition 4.1) and they do not belong to the same sequence but they belong to sequences S 2 , S 3 , S 4 , and S 7 respectively (Condition 2 in Definition 4.1). We have The second compatible set of remaining events is core 2 = {A 22 , A 32 , A 61 , A 82 }. Note that core 2 satisfy coditions 1 and 2 in Definition 4.1 since all events in core 2 have the same label, A. the events in core 2 do not belong to the same sequence but they belong to sequences S 2 , S 3 , S 6 , and S 8 respectively (Condition 2 in Definition 4.1). We have

) Insertion of the Three Representatives of the Three Compatiable Sets in BigSeq
The initial BigSeq with its indices and the final BigSeq with its indices are reported at Table III(a) and Table III(b), respectively. The final BigSeq is ACBCABAC.Note that we insert the three representative C, A, and A in BigSeq at position i 1 + 1 (between i 1 and i 2 ), i 3 + 1 (between i 3 and i 4 ), and i 4 + 1 (between i 4 and i 5 ). Here the size of the final BigSeq is 8 after inserting the three representatives. Therefore, the actual indices of the events in the final BigSeq will be from 1 to 8 (i.e. 1, 2, 3, 4, 5, 6, 7, and 8). See the next definition for the size of the final BigSeq. The Final BigSeq Size is |f inal BigSeq| = |initial BigSeq| + |core|, where |core| is the count of compatible sets of remaining events, where initial BigSeq is the initial value of BigSeq.
For example, with respect to the event sequence database and the data in Tables I and II respectively, we have the following |f inal BigSeq| = |initial BigSeq| + |core| = |S 1 | + |core| = 5 + 3 = 8.
From definition 4.2, to reduce the final BigSeq Size, we should reduce the count of compatible sets of the remaining events as much as possible.
To conserve all characteristics of the original event sequence in the final BigSeq, we should associate with each event e in BigSeq a list of sequence ids that contains the event e, namely e.id list as follows. First, since we select the first sequence as the initial value for BigSeq, we will  Id List  1  8  1  1  2  1  2  1  2  2  3  3  3  3  3  5  5  4  4  5  6  4  8  7  5  7  6  8  6  8  6  7  7 8 (c) Insertion of the Three Representatives for the Three Compatible Sets in BigSeq with their Id List add 1 (the id of the first sequence) to e.id list for each event e ∈ initial BigSeq = S 1 [see Table IV (a)]. Second, suppose the case that the event e ∈ LCS(S ′ , BigSeq), where S ′ is a remaining sequence (i.e. e ∈ S ′ and e ∈ BigSeq). In this case, we add the id of S ′ to e.id list for each event e ∈ BigSeq [see Table IV (b)]. Finally, we have three representative events for the three compatible sets of remaining events. As we mentioned before, we inserted the three representatives, A, A, and C in BigSeq at positions i 3 +1, i 4 + 1, and i 1 + 1 respectively to generate the final BigSeq.
For each representative event, e rep , for the compatible set of remaining events, core k , we add to e rep .id list the id of the event sequence that contains the remaining event e r , for every e r ∈ core k with k = 1, 2, and, 3 [see Table IV (c)].
Next algorithm outlines the BigSeq construction with sequence Id List.  for each event e ∈ lcs do // e ∈ S ′ and e ∈ BigSeq 8.
Add the id of the sequence S ′ that contains the event e to e.id list in BigSeq 9.
end for 10.
for each event e r ∈ S ′ and e r / ∈ lcs do //e r is remaining event 11.
ERP = ERP ∪ ERP (e r ) 13. end for 14. end for 15. Find the compatible sets of remaining events, core, based on Definition 4.1 and ERP 16. for each compatible set core k ∈ core 17.
Insert the representative event, e rep , for core k into BigSeq at position p ∈ ERP (e ′ ) ∀ e ′ ∈ core k 18.
Add to e rep .id list the id of the event sequence that contains the remaining event e r ∀ e r ∈ core k 19. end for 20. return BigSeq // The final BigSeq with e.id list for each event e ∈ BigSeq

V. COMPRESSING EVENTS SEQUENCES DATABASE USING BigSeq METHOD
The objective of this paper is to compress the event sequence database in efficient way such that we conserve all characteristics of the original database. In other words, we will compress the event sequence database into compact BigSeq with sequence id lists. But when the size of event sequence database is large, the id list size of each event in the corresponding BigSeq will be large. To confine the larger size of these id lists, we can represent e.id list of each event e ∈ BigSeq as a set of bit-vectors, B(e) = {B 1 , B 2 , ..., B m }, where each B i is 8 length bit-vector (i.e. each B i occupy 1 byte in memory) and suppose that the maximum id ∈ e.id list is n then m = |B(e)| = n/8. Each position in each B i corresponds to event sequence S id ∈ D where id ∈ [8 × (i − 1) The bit at position j in B i represents the presence or absence of the event e ∈ BigSeq in the event sequence S 8×(i−1)+j . See next example.
Example 5.1: The first event in BigSeq (in Table V) Table IV(c), its corresponding BigSeq with bit-vectors is reported in Table V.

A. Compression Benefit
Suppose each event e occupy 1 byte in memory, then the size (in terms of bytes) of the original event sequence database, D (in Table I) is 34 bytes (D contains 34 events). Recall each B i occupy also 1 byte in memory, therefore the size (in terms of bytes) of the BigSeq with bit-vectors (in Table  V) is |BigSeq| + |B(e)| = 8 + 8 = 16 bytes. We can use the compression ratio to measure how well the data is compressed. The compression ratio calculated by dividing the data size before compression with the size after compression. In the above example, the compression ratio is 34/16 = 2.125. In other words, the space saving (%) is (1 -(compressed size / uncompressed size)) × 100 = 1 − (16/34) × 100 = 52.9%. Here, we have an optimization that based on the observation that there are many consecutive zeros in each row of the bitvectors. This is clearly grossly inefficient. Therefore, we can compress these consecutive zeros in efficient way as follows. Given array of bits (0 and 1), Bit Arr, and paramter n. The output is the same as the input except for consecutive zeros. Note that, may be there are many sets of consecutive zeros in Bit Arr. For each set of consecutive zeros, CZ, we do the following. If |CZ| ≤ n + 2, we do nothing. Otherwise, we compress CZ into compressed CZ with size n+2 bits. The first and the second bits in the compressed CZ are 0 (indicator for compressing CZ) and 1 (indicator for doing the compression) respectively. The other n bits in the compressed CZ indicate how many times of zeros were repeated consecutively.
In the experimental results section, we will show the better compression ratio of BigSeq method against the state-of-art algorithm, GoKrimp, on many real datasets.

VI. EXPERIMENTAL EVALUATION
This section reportes the results of experiments on many real dataset. We compare the performance of Our proposed method, namely BigSeq with GoKrimp algorithm [8] [9]. Here, we exclude the two algorithms SeqKrimp and SQS from this experiment since GoKrimp algorithm outperforms them by one to two orders of magnitude. BigSeq is implemented in standard C++ with STL library support and compiled with GNU GCC. Experiments were run on laptop with Intel i3 2.4 GHz and 8G memory running Linux.

A. Datasets
Experimental evaluation are performed on a group of real datasets as follows. We used five real datasets namely, msnbc [10], Gene [16], TCAS [17] [4], Activity [1], and JBoss [5]. The corresponding information of these real datasets is summarized in Table VI, where |D| represents the number of sequences, |E| is the number of the events, min L, max L and avg L denote the minimum length, maximum length and average length of the sequences respectively.

B. Effect of Optimization
In this experiment, we show the effect of optimization of compressing the consecutive zeros, namely Opt, with respect to the compression ratio.

C. Performance of BigSeq against GoKrimp
From the previous experiment, BigSeq with Opt has the best performance with respect to compression ratio. Therefore, in this experiment, we will use BigSeq with Opt and we will call it as BigSeq for abbreviation.
The proposed method, BigSeq is evaluated according to the following criteria: • Compression Ratio: To measure how well the dataset is compressed using BigSeq.
• Total Response Time: To measure the efficiency of BigSeq.
• The Number of Patterns: The number of detected patterns that used for compression. Table VIII reports the compression ratio of the two algorithms on the five datasets. Recall, the larger compression ratio is the better compression we have. The BigSeq algorithm shows a better compression ratio in all datasets. For example, in Gene dataset, the compression ratio of BigSeq is 2.428 while the compression ratio of GoKrimp is 1.251. 2) Total Response Time (Sec): Table IX reports total response time (Sec) of the two algorithms on the five datasets. The BigSeq algorithm has the best execution time on all datasets. On msnbc dataset, Gene dataset, TCAS dataset, Activity dataset, and JBoss dataset, BigSeq outperforms GoKrimp by more than two orders of magnitude, more than one order of magnitude, more than three orders of magnitude, approximately three factors, and more than one order of magnitude respectively.  Table X reports the number of patterns that used for compression by the two algorithms on the five datasets. Note that BigSeq used only one pattern for all datasets. This pattern is the compact BigSeq itself.

VII. CONCLUSION
In this paper, we focus on summarizing the event sequence dataset. Existing methods summarize the event sequence dataset by mining a significant patterns that compress the dataset well. In contrast, the novel proposed method, BigSeq summarizes the event sequence dataset by merging the event sequences into compact big sequence. The construction of the compact big sequence is done via the longest common subsequence and the novel definition of the compatible event set. Our compact big sequence converses the all characteristics of the original event sequences. Experimental results show that BigSeq method can achieve better performance than the stateof-the-art methods such as GoKrimp in terms of compression ratio, total response time, and number of detected patterns. As future work, we plan to adapt the BigSeq method for mining the frequent, closed, and maximal patterns in the event sequence dataset.