Word-Based Grammars for PPM

The Prediction by Partial Matching (PPM) compression algorithm is considered one of the most efficient methods for compressing natural language text. Despite the advances of the PPM method for the English language to predict upcoming symbols or words, more research is required to devise better compression methods for other languages, such as Arabic due, for example, to the rich morphological nature of the Arabic text, where a word can take many different forms. In this paper, we propose a new method that achieves the best compression rates not only for Arabic text but also for other languages that use Arabic script in their writing system such as Persian. Our word-based method constructs a context-free grammar (CFG) for the text and this grammar is then encoded using PPM to achieve excellent compression rates. Keywords—Component; context-free grammar (CFG); grammar-base; word-based; Preprocessing; Prediction by Partial Matching (PPM); encoding


I. INTRODUCTION
The Prediction by Partial Matching (PPM) compression algorithm is one of the most effective kinds of statistical compression. First described by Cleary and Witten in 1984 [1], there are many variants of the basic algorithm, such as PPMA and PPMB [1], PPMC [2], PPMD [3], PPM [4], PPMZ [5] and PPMii [6]. Prediction in PPM depends on a bounded number of previous characters or symbols, effectively using a Markovbased approach. Despite the cost in terms of memory and the speed of execution, PPM usually attains better compression rates compared with other well-known compression methods.
In PPM, to predict the next character or symbol, different orders of models are used, starting from the highest order down to the lowest orders. An escape probability estimates if a new symbol appears in the context [1], [2] and if an escape is encoded, the algorithm will back-off to a lower order model. The "full exclusions" mechanism [1] is used to significantly improve compression by excluding the prediction of higher order symbols when an escape has occurred since these characters were not encoded [17]. Experimental results show that not using full exclusions speeds up the execution time of programs but compression is reduced.
However, when a PPM approach is applied to words rather than characters, it is not clear what the most effective method for encoding the text is. This is because there are issues of how to encode the spaces and punctuation along with the text, how to deal with capitalized words, whether to treat digit sequences differently, how to deal with the much larger alphabet when using full exclusions, and so on. This is compounded further when considering certain languages, such as Arabic, which has a rich morphological structure which potentially presents further types of difficulties for word-based compression compared to languages, such as English since the same word can take many different forms.
As an illustration, the lists below in Table 1 show the most common words in each of the examined texts. They are based on an analysis of the Brown Corpus for American English [9], the LOB Corpus for British English [10], the BACC [11] and CCA [12] Corpora for Arabic text, the Hamshahri corpus for Persian text [13] and the CEG corpus for Welsh text [16].
Substitution of these words using our context-free grammar scheme and standard PPM can significantly improve overall compression as shown below. For example, natural languages contain common sequences of words that often repeat in the same order, such as in English "the" , "of" and "and", and for the Arabic language in the BACC corpus , such as ‫,"هي"‬ " ‫"فٖ‬ and so on. From Table 1, the most common word "the" for both the American and British English is found to be "the". However, for these corpora if one treats capitalized words as being distinct (that is, "the" is treated as distinct from "The"), we find that the word "The" also appears in the top 20 ranked words, but at different ranks (12 for the Brown Corpus versus 16 for the LOB Corpus). In contrast, the word "had" appears with the same rank for both corpora. Certain words, such as "from" and "at" appear in the list for one corpus but not for the other .   THE TOP COMMON 20 WORDS FOR THE BROWN, LOB, BACC,  TABLE I For Arabic text, the most common word for both the BACC and ACC Corpora is found to be ‫"فٖ"‬ (in). Nevertheless, we find that the word ‫"اى"‬ (that) also appears in the top 20 ranked words, but at different ranks (4 for the ACC Corpus versus 6 for the BACC Corpus). In contrast, the word ‫"هي"‬ (from) appears with the same rank for both corpora. Certain words such as " ‫"التٖ‬ (which) and ‫"لَ"‬ (for him) appear in the list for one corpus but not for the other. For Persian text in the Hamshahri Corpus, even if it uses Arabic script, the top 20 ranked words are noticeably different due to the difference between these two languages.
From these lists, it is clear even just from examining the top 20 ranking words that there are important differences, and therefore word-based compression schemes have to adapt directly to the text being compressed in an online manner (as PPM does) rather than use dictionaries created from general sources. Another factor is that since the most frequent words represent a significant proportion of the text, adaptive wordbased schemes can often lead to improved compression for many languages. An added advantage of such schemes is that much less symbols need to be encoded (for example, for English, there is on average approximately five times less word symbols than there are character symbols). However, finding the most effective word-based compression is still an open problem with word-based schemes under-researched compared to character-based schemes. The comparison between the effectiveness of word-based schemes with character-based and parts-of-speech (tags) based ones also provides an interesting tool for performing further linguistic analysis [8]. The main contribution of the work described in this paper is the improved word-based compression method for PPM. This is due to parsing of the text to construct a word-based context free grammar (CFG) which is then compressed using PPM.
The rest of the paper is organized as follows. Previous work is discussed first. Then our new approach is discussed in the next section. We discuss experimental results for various natural language texts in order to evaluate how well the new scheme performs compared to other well-known methods. The summary and conclusions are presented in the final section.

II. PREVIOUS WORK
As stated, standard PPM word-based models predicts the forthcoming symbol, starting from the highest order context; but when the upcoming symbol has not appeared in this context then a lower context is used and an escape symbol is encoded. There have been a number of methods that have been used to estimate the probability for these escape symbols [7], [8].
Experiments indicate that the X1 method is the best performing for English text in the most cases [8]. This method is given by the formula:

 
Here, t 1 denotes the number of symbols seen previously only once in the context and T d is the frequency with which the symbol occurs in the context. Therefore, this method estimates the escape symbol probability proportionate to the number of words that have appeared only once in the text. SOME MODELS FOR PREDICTING CHARACTERS AND WORDS Experiments for the English language show that word based models in Table 2 presents the best performance among other models [8].
Model C|C 5 is a PPM character model of order five that predicts the probability of character symbols and used as a compression baseline. In this model, the formula for the probability of text string S of m characters is given by: Where, the preceding five characters in the text is used to estimate the probability of the forthcoming symbol.
This estimate of the probability for the previous formula depends on the escape method (in Table 2, the symbol → denotes an escape). In character based models, if the highest order fails to predict forthcoming symbol, the probability of escape is encoded using the next highest order.
The second model W|W, is a PPM order one word-based model that predicts the probability of word symbols. In this model, the estimation of the probability for the forthcoming word depends on the previous word in the text as represented by the following formula for the probability of text string S of n words: Where, p denotes the probability of the symbols in the sequence of the text S based on words. If the word is not predicted by this model, then an escape is encoded down to the order 0 model. If the word still has not been seen in this context, then a further escape is encoded followed by each character in the word being encoded separately using the standard PPM character-based model.

III. WORD-BASED GRAMMARS FOR PPM (GRW-PPM)
A new approach based on word-based context free grammars (CFGs) for compressing text files is presented here. This algorithm, which we call GRW-PPM (which is short for grammar word-based pre-processing for PPM) uses both CFGs and PPM as the basis of a universal general-purpose adaptive compression method for text files.
In our approach, we essentially parse words, digits, spaces and punctuation in the source file to first generate a grammar with rules and terminal and non-terminal symbols representing each of these text elements. We then substitute every time (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 8, No. 10, 2017 290 | P a g e www.ijacsa.thesai.org when one of these text elements occurs in the source text with the single unique non-terminal symbol as specified by its rule in the grammar. This is done during the pre-processing phase prior to the PPM compression phase which is applied to the sequences of non-terminal symbols for words, digits and spaces and punctuation separately.
Our method replaces sequences of words (n-grams) in the text as they are processed from beginning to end in a single pre-processing pass. The PPM algorithm is used as the encoder once the sequences have been replaced. Unlike PPM, our method is off-line during the phase which generates the grammar.
Our approach adapts the W|W word-based method and the character n-graph replacement pre-processing approach of Teahan [8] by using an off-line technique to generate the list of word n-grams first from the source file being compressed. However, our approach is considered within a grammar-based context instead. The main difference with the prior word-based schemes (such as W|W) is the use of PPM to encode the sequence of word symbols directly without the need to escape to a separate character-level encoding and also treatment of digits as word symbols (see below).
The grammar in GRW-PPM shares the same characteristic as Sequitur by Neville-Manning and Witten [14] and GR-PPM [15] which is that no pair of symbols appears in the grammar more than once. This property ensures that every n-gram in the grammar is unique, a property called non-terminal uniqueness using the same terminology proposed by Neville-Manning and Witten. To make sure that each rule in the grammar is useful, the second property, referred to as rule utility, is that every rule in the grammar is used more than once in the corrected text sequence. Fig. 1 shows the whole process of GRW-PPM. First, the original text will be parsed and word, digit and space/punctuation tokens will be extracted then the CFG will be generated by replacing them in the text wherever they occur with the non-terminal symbols as defined by their rules in the grammar. After the rules have been produced, the grammar is encoded by using PPMD, and the resulting compressed text is then sent to the receiver. The receiver then decodes the grammar by using PPMD to decompress the compressed file that was sent. The reverse mapping is then facilitated by using the decoded grammar to regenerate the original source text. Table 3 illustrates the process of GRW-PPM using a sentence referring to the song by Manfred Mann: "The song 'Do Wah Diddy Diddy Dum Diddy Do' was recorded on 11 June 1964 and released on 10 July". First, the original text will be parsed from left to right and new non-terminal word and digit symbols (S 1 S 2 S 3 S 4 S 5 S 5 S 6 S 5 … S 12 S 9 D 3 S 13 ) will be substituted for each unique n-gram (defined as being separated by the intervening space and punctuation symbols). For this example (and for the experiments described below), we use single words (unigrams), although the method works in a similar way for word bigrams and trigrams. Referring to Table 3, we replace the unigram "The" with non-terminal symbol S 1 , unigram "song" with non-terminal symbol S 2 , unigram "Do" with non-terminal symbol S 3 and so on. We use bullet points for spaces to make them visible. Spaces (white-space) and punctuation define the word boundaries (i.e. each word is made up of sequences of anything that is not whitespace or punctuation).   S → S1 S2 S3 S4 S5 S5 S6 S5 S3 S7 S8 S9 SD S10 SD S11 S12 S9 SD S13 V → S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 The" S2 → "song" S3 → "Do" S4 → "Wah" S5 → "Diddy" S6 → "Dum" S7 → "was" S8 → "recorded" S9 → "on" S10 → "June" S11 → "and" S12 → "released" S13 → "July" D1 → "11" D2 → "1964" D3 → "10" P1→ "•" P2 → "•"" P3→ ""•" P4→ "." www.ijacsa.thesai.org Table 4 shows the same process for a sample Arabic text (Fig. 2) which translates into English as follows: "The number of shares traded in the market, "Saudi" were more than 277 thousand shares, and the number of transactions were more than 132 thousand transactions." However, in this case the ngrams are generated from right to left instead. Each unique Arabic unigram has a non-terminal symbol associated with it. For instance, words ‫,"ّبلغ"‬ ‫"عدد"‬ and ‫"األسِن"‬ are replaced by non-terminal symbols S 1 to S 3 , respectively.
In the grammar examples, the S rule is used to represent the word and digit symbols sequence. Separate rules (S 1 , S 2 , S 3 …) are used, one for each word, to specify each symbol"s contents directly using a non-terminal (denoted by characters surrounded by " "s). The V rule enumerates each of these words in order; it is used to represent the vocabulary (the sequence of unique words as they occur in the text). Each digit sequence is encoded within the S sequence by using a special symbol to indicate the positions of the digits in the sequence (as represented by S D in the above examples). The actual contents of each digit symbol is specified by the D rule and encoded separately to the word and digit symbols. We also process spaces and any punctuation characters in order to be able to fully decode the original text back. These are represented by the P rules for the grammars in the above examples and are similarly encoded separately to the word and digit unigram symbols. Moreover, the grammar will be transmitted to the receiver once it has been constructed after all unigrams are substituted in the original text with their nonterminal symbols.
The grammar represents a complete description of the text and therefore it is possible to devise a lossless text compression scheme by directly encoding it in some manner since it is possible for the decoder to regenerate the complete source text losslessly once the grammar has been decoded.
ANOTHER EXAMPLE GRAMMAR GENERATED BY GRW- PPM  TABLE IV. FOR A SAMPLE ARABIC TEXT Sequence:

Grammar:
S → S1 S2 S3 S4 S5 S6 S7 S3 S7 S8 SD S9 S10 S1 S2 S11 S7 S8 SD S9 S12 V → S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 As stated, we have found one effective means for encoding the grammar is to use PPM. Specifically, the grammar is encoded by using PPMD to separately encode the four main elements (words, vocabulary, digits and spaces/punctuation as represented by the S, V, D and P rules). For Rule S, we can encode the sequence of symbol numbers or letters that appear in the rule. For example, in Table 3, the sequence of symbol numbers/letters for Rule S is as follows: 1 2 3 4 5 5 6 5 3 7 8 9 D 10 D 11 12 9 D 13. This represents the sequence of id numbers assigned to each unique word with id numbers starting from 1 and incrementing by one whenever a new word is encountered. The letter D indicates when a digit sequence has occurred. Clearly, the sequence for rule S will be highly repetitive for long sequences of natural language text because of the presence of repeated words and frequent function words (such as "the" and "and" for English and ‫"هي"‬ and " ‫"فٖ‬ for Arabic as shown in Table 1). More specifically, we have found PPMD to be very effective at encoding this sequence.
However, unlike W|W (which uses similar PPM-like methods to encode word symbols in this manner), our method simply uses PPMD with a fixed maximum alphabet size (since this is known when the grammar has been fully constructed for the whole text). Also, our method does not need to encode an escape down to a separate character-level as W|W does in order to encode novel words when they occur.
Instead, it uses the standard PPMD encoding mechanism (where a novel symbol will be encoded using a default order -1 model where all symbols are equiprobable).
For practical purposes, rule V and rules S 1 , S 2 , S 3 , … can simply be represented as a string of text that contains all the unique words as they appear in the source text one after another with a separator (such as a space character) used to indicate the end of the previous word and the beginning of the next one. Similarly, we can use the same encoding technique for the digit sequences for rule D and rules D 1 ,D 2 ,D 3 ,… and for the spaces and punctuation for rule P and rules P 1 ,P 2 ,P 3 ,…. That is, both the digits and punctuation can be encoded effectively by using PPMD to encode one text string that contains all the unique digit sequences and another text string that contains the unique space and punctuation sequences respectively. A space character can be used as a separator for the digits, but for the punctuation, a different separator is needed. We use the letter "W" as the separator in this case to mark where the words are.
As an illustration, Table 5 presents the symbols or text that is being encoded for the four elements (symbols, vocabulary, digits, spaces and punctuation) for the beginning of the Brown corpus. All are encoded directly by PPMD as text except for the Symbols element which is treated as a sequence of numbers instead.
The decompression process first uses PPMD to decode the four separate elements and then re-constructs the full grammar from them. During the subsequent regeneration phase, the grammar is then used to exactly regenerate the original source text character for character (i.e. the method is completely lossless). Whenever a previously unseen symbol is encountered as the sequence specified by the S rule is being processed, the current word is read from the sequence specified by the V rule and then the position is moved along to the next word.

Brown Corpus (text at the start of the corpus):
The Fulton County Grand Jury said Friday an investigation of Atlanta"s recent primary election produced "no evidence" that any irregularities took place. The P rule is used to insert the punctuation between the word and digit symbols as they are encountered in the S rule. Whenever a digit is signified by the S D symbol for this rule, the current digit symbol is read from the sequence specified by the D rule, which is then inserted into the decoded output sequence and the position then moves along to the next digit symbol.
Algorithm 1 summarizes the algorithm using pseudo-code. Lines 1 through 15 are for the n-gram tokenizer. Line 3 starts the for loop to read the n-grams in the input file. Lines 4 through 9 check if the n-gram is a word; if it is, it prints the ngram to the Grammar file, assigns each id numbers with ids for unique n-grams increasing with each new n-gram that is found and also prints a W to the Spaces & Punctuation file. Lines 10 through 13 checks if this n-gram is a digit; if it is, it adds this digit to the digit file and prints W to the Spaces & punctuation file. Lines 14 and 15 checks if this n-gram is punctuation or space; if so these are added to the Spaces & Punctuation file. Line 16 compresses the final text for the four files by using PPMD.
A further improvement of our approach, both in terms of compression and execution speed, can be gained by further processing the files in the following manner. The main disadvantage of the Symbols file is that it consists of many singletons that occur only once in the text and doubletons that occurs only twice [18]. Singletons and doubletons are detrimental to the encoding efficiency because they do not give any useful reference information [19]. In addition, singletons incur an unnecessary extra cost in our scheme because their symbol numbers are unique and therefore cause the alphabet size to be incremented by 1 each time they occur (which is frequently due to the Zipf"s Law-like nature of natural language text). As a result, the alphabet size can be substantially higher when these are present. A large alphabet for PPM is undesirable when using the full exclusions mechanism [1] that PPM uses for its encoding as it substantially slows down execution speeds due to the need to exclude symbols already seen in the higher orders from lower order predictions.
In order to overcome these problems and therefore improve our new method, we process the Symbols file to replace all singletons in the Symbols file with the same special symbol wherever they occur. For example, for the Symbols stream "1 6 7 6 7 7 4 5" there are three singletons -1, 4 and 5. These singletons get replaced by a special symbol (, say) and the Symbols sequence being encoded becomes " 6 7 6 7 7  ". Each singleton can be readily decoded once the special symbol is encountered in the Symbols stream which signals to the decoder to read the characters for the word from the next set of characters in the Vocabulary stream up until the next word separator character. For our example, let"s say that the characters in the Vocabulary stream are "one six seven four five". When replacing just singletons in the Symbols stream, there is no need to change this Vocabulary stream since the decoder will have all the necessary information to decode each word since singletons only occur once. The only effect is that the Symbols stream becomes slightly more compressible with a much smaller alphabet which significantly speeds up compression speeds when performing full exclusions as shown below. www.ijacsa.thesai.org We also have an option to replace doubletons and tripletons (and so on) wherever they occur in the Symbols file if we wish. However, when replacing non-singletons in this case, there is no way to decode the characters when the word is being replaced the second time or subsequent times (for tripletons etc.) so a simple expedient is to repeat the word character for character in the Vocabulary stream whenever it occurs again. Using the previous example again, if we were to replace singletons and doubletons (but not tripletons), then the Symbols sequence would now be encoded as "  7  7 7  " since the symbol 6 appears twice (i.e. it is a doubleton) but symbol 7 appears three times (i.e. it is not a singleton or doubelton). In the Vocabulary stream in this case, the characters for symbol 6 would appear twice, i.e. it would now become "one six seven six four five" since the word "six" is a doubleton and therefore appears again in this sequence. Clearly, the size of the Vocabulary stream now will grow because of the presence of the repeated words and this can affect the overall compression, but this is offset by the significantly faster processing since the alphabet size in the Symbols stream is much smaller.
In the experimental results below, we use the following labels for the variants of our algorithm: GRW-PPM for our standard algorithm; GRW1-PPM for when singletons are replaced by the special symbol; GRW2-PPM for when both singletons and doubletons are replaced; GRW3-PPM for when all the singletons, doubletons and tripletons are replaced; and GRW4-PPM for when all the singletons, doubletons, tripletons and quadrupletons are replaced.

IV. EXPERIMENTAL RESULTS
This section discusses experimental results using GRW-PPM and its variants described above for compression of various text files. We compare our new method with other compression schemes. Also, we discuss in this section the encoding execution times for GRW-PPM with and without using the full exclusions mechanism that PPM uses for its encoding.
In this experiment, the GRW-PPM encoding is divided into four parts. The four parts are for the Grammar, the Symbols, the Digits and the Spaces and Punctuation. Order 5 PPMD is used for the Grammar, order 1 PPMD for the Symbols, order 4 PPMD for the Digits and for Spaces and Punctuation, order 4 PPMD is used. Experiments showed these different orders were the most effective at compressing the different text elements. Table 6 illustrates the compression ratio for the four parts. The compression ratio is calculated by multiplying the compressed output size in bytes times 8 divided by the original input file size in order to determine the contribution each part has to the overall encoding cost. As shown in the table, the Digits part has the smallest compression rate for the different languages. Also, the compression rate for Grammar and Spaces and Punctuation are small compared to the Symbols part for the Brown, LOB, CEG, Hamshahri and BACC corpora.
As shown in Table 7, order 1 GRW3-PPM significantly outperforms order 1 GRW-PPM as it has the best compression ratio for the corpora being compressed. The improvement of GRW3-PPM over GRW-PPM occurs for all texts and ranges from over 2% to 4.2% for the BACC corpus of Arabic text. Tables 7 and 10 for different text files, we found that full exclusions improves the compression rate. However, this increases the execution time slightly because for full exclusions all symbols are removed for prediction in the lower order level if they have already been seen in the higher order. (There may be many symbols needing to be excluded depending on the context.) The configuration of our test machine is 4 GB GHz intel Core i5, with 4GB internal memory. Tables 8 and 9 that not using full exclusions result in a worse compression rate. The improvement of GRW1-PPM and GRW2-PPM with full exclusions over GRW1-PPM and GRW2-PPM without using full exclusion ranges on average from just over 4% to 5.4% for all texts. However, the advantage in not performing full exclusions is that this runs on average 3% to 20% more quickly for different texts. Table 11 shows an interesting result when comparing GRW-PPM and GRW3-PPM with PPMD and W|W. It is clear that GRW3-PPM on average significantly outperforms W|W. GRW3-PPM shows an average 7.1% improvement over W|W. Also, it illustrates that there are significant differences between each of the compression methods for different languages. COMPRESSION  For instance, for American English text, W|W achieves the best compression rate compared with other models, with a 3.6% improvement over GRW-PPM and a 0.45% improvement over GRW3-PPM. For British English text, W|W achieves a 4.3% improvement over GRW-PPM and a 1.0% improvement over GRW3-PPM. For Welsh, GRW3-PPM and PPMD attain a 2.3% improvement over GRW-PPM and approximately a 1.0% improvement over W|W. For Arabic text, GRW3-PPM outperforms the other models, attaining a 14.6% improvement over PPMD and a 15.7% significant improvement over W|W. For Persian text, GRW3-PPM exceeds the other models, with a 22.3% improvement over W|W (see Fig. 3). V. CONCLUSIONS

It is clear from
In this paper, a new word-based grammar scheme (GRW-PPM) has been described for compressing natural language text. Our method creates a context-free grammar by replacing words and repeated sequences of digits, spaces and punctuation represented as non-terminal symbols in the text as it is processed from beginning to end in a single pre-processing pass. The PPM text compression algorithm is then used as the compression algorithm to encode the sequences of nonterminal sequences once they have been constructed for the whole text. Unlike PPM which is an online method, our method is off-line during the phase which generates the grammar.
In our experimental evaluation, GRW-PPM (and further such as variants GRW2-PPM and GRW3-PPM) have been compared with other well-known schemes on various language corpora for the English, Welsh, Arabic and Persian languages. The best performing scheme for the languages that use Arabic script (Arabic and Persian) is GRW3-PPM, followed by the previous best performing word-based PPM models (W|W) then the standard character-based PPMD scheme. For the English language, our experiments show that the word-based PPM models (W|W) is the best compared with standard PPM and GRW-PPM. For Welsh text, the best results are achieved using the standard character-based PPMD scheme and GRW3-PPM. Also, GRW3-PPM significantly outperforms GRW-PPM itself for different languages.