Speeding up Natural Language Text Search using Compression

Text search is a well-known problem in computer science where the valid shifts of a pattern P in a text string T are found. This paper shows how to speed up text search by searching for P in a compressed version of T. A fast compression algorithm was designed for this aim. This algorithm is based on the assumption that T is restricted to the letters of a single natural language. Relying on this assumption, a letter, in T or P, is encoded into a single byte instead of the two-byte unicode which shortens the string on which a text search algorithm works. The main disadvantage of this approach is the restriction of the alphabet of T to be from a single natural language. However, wide range of text documents complies to this assumption. Another issue is the overhead that is required to compress P and T, but it was found that the proposed compression algorithm is so fast such that its run-time can be paid for and still save text search time. Different approaches to store compressed T are also explored. The conducted experimental study showed that this approach does actually reduce the text search time. Keywords—Text compression; text search; unicode


I. INTRODUCTION
A lot of research was directed towards searching in compressed text. A survey of the approaches to search in compressed text without decompression can be found in [1]. In [2], text search was applied on a directory-based compressed text. In [3], the characters are encoded as a variable-length sequences of base symbols of fixed number of bits. In [4], the input text is already compressed with Lempel-Ziv. In [5], a compression and decompression techniques for natural language text are proposed. The compression scheme that is used is based on semi-static word-based model and Huffman encoding where the coded alphabet is byte oriented rather than bit-oriented. In [6], an approximate search on the compressed search using local decompression is proposed. In [7] the input text is assumed to be Ziv-Lempel Compressed Text. In [8], the text search in compressed text is done using periodicity analysis, with sublinear run time with the size of compressed text. In [9], a directory based compression is used on natural language text.
The main observation that can be noticed in the previous research regarding this problem is that: the primary motivation was to do text search in an input string that is already compressed using known compression algorithms without decompressing it first. This means that compression was not originally done to speed text search. This is the main point that contrasts this work from others work. In this paper, compression is done to speed up the text search first and to save space as a second gain. The compression that was considered in literature is based on known compression algorithms which are known to be complex and timeconsuming. On the other hand, the proposed compression is very fast and simple. A similar approach to our approach was found in [10]. However this work differs from our work in many ways: (1) T is assumed to be in ASCII while our work is based on unicode encoding, (2) our compression approach is much faster and simpler, (3) the shifts that are found in the compressed T can easily be translated to shifts in the original uncoded version of T.
The proposed work in this paper is based on the fact that T is encoded in unicode [11]. Unicode is an international standard for encoding alphabets of natural languages, two bytes for a letter. Alphabets of different natural languages are encoded in ranges. One observation about unicode is that the alphabet of the same natural language share the same upper byte value. For example, Arabic alphabet unicodes range between 0x0600 up to 0x6FF with the same upper byte code 0x60. This fact will be utilized to compress T and P to reduce their length to half by excluding the upper byte. For example, the Arabic word ‫)ھﻮ(‬ is composed of two letters (Fig. 1). The unicode of letter ‫)ھـ(‬ is 0x0647, the upper byte is 0x06 and the lower byte is 0x47. The unicode of the second letter ‫)و(‬ is 0x0648 with upper byte is 0x06 and lower byte is 0x48. The proposed compression is based on using only the lower byte as a code for the letter. This will compress T into half size. Note that this compression works only under the assumption that T contains only text letters of the same alphabet. Moreover, the code of the first letter ‫)ھـ(‬ is placed in the upper byte in the compressed unicode. This is because Arabic script is written from right to left and hence it assures that the encoded letters will be stored in the same order that they have within the original text.

A. COMPRESS Algorithm
COMPRESS algorithm (Fig. 2) returns a compressed string (Scompressed) for an input unicode string S. S could be T or P. 407 | P a g e www.ijacsa.thesai.org (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 12, No. 4, 2021 Scompressed will be half the length of S. S is a string, where each letter is encoded with two-byte unicode. S compressed is generated by reducing the two-byte unicode code of every letter in S into a single byte in S compressed ( Fig. 1). To show how this algorithm works for the example in Fig. 1, let S be the string ‫)ھﻮ(‬ with two letters S 0 = ‫)ھـ(‬ and S 1 ‫.)و(=‬ The algorithm will do a left-shift on S 0 by eight bits to generate S' 0 (0x4700). Next, a bit-wise and operation is applied between S 1 and 0x00FF to generate S' 1 (0x0048). Finally, a bit-wise or operation is applied between S' 0 and S' 1 which will result in one letter unicode (i.e. 0x4748) that will be appended to S compressed . The loop will iterate for the letters of S in pairs until S compressed is complete. In case of S is of odd length, a space will be appended to make its length even. One way to store T compressed is through storing it in a text file using unicode. In this case, every pair of bytes of T compressed is interpreted as a single unicode code character and the corresponding character of this unicode is written to the file. So, the size of this file will be half the size of the original T's file. In addition to compression, it will be encrypted. For example, T= " ‫ﺑﺴﻢ‬ ‫ﷲ‬ ‫اﻟﺮﺣﻤﻦ‬ ‫اﻟﺮﺣﯿﻢ‬ ", which is composed of twenty two characters (letters and spaces), will be encoded into eleven unicode characters. Fig. 3 shows how T compressed looks like when its file is opened in a text editor. T compressed may also be stored as binary file (Fig. 4). The letters of T compressed are written into a binary file a byte by byte without building unicode letters from pairs of bytes. In this case, the letter is represented as ASCII code. For example, for T=" ‫ﺑﺴﻢ‬ ‫ﷲ‬ ‫اﻟﺮﺣﻤﻦ‬ ‫اﻟﺮﺣﯿﻢ‬ ", each character is represented by one byte. This byte corresponds to the lower byte of the unicode of that character. For example the first byte is '(' which has the hex value 0x28 is the code of the first letter ‫)ﺑـ(‬ . Note that the unicode for letter ‫)ﺑـ(‬ is 0x0628.

III. SPEEDING TEXT SEARCH WITH COMPRESSION
The compressed text search algorithm is shown in Fig. 5. It takes as input P and T, compresses them using COMPRESS algorithm and then calls any known string matching algorithm to search for P compressed within T compressed . Although T compressed has half the size of T, the length of T compressed equals to the length of T. This is because T compressed is viewed as an array of bytes (on byte for each letter) and T is viewed as an array of unicodes. This is also true for P and P compressed . The equality comparisons within the selected string matching algorithm will be byte-wise comparisons and not unicodes comparisons. So, the calculated valid shifts that are found by the this reused string matching algorithm will be the valid shifts of P within T.  Vol. 12, No. 4, 2021 IV. EXPERIMENTAL STUDY The compressed text search algorithm was implemented in Java, where T is chosen to be the text of the Holy Quran, which is composed only of Arabic letters, with size of 411,082 letters. The selected string matching algorithm was the known Knuth-Morris-Pratt (KMP) algorithm [12]. To show the reduction in text search time, the search time that is needed to search for P in T using the compressed text search algorithm is compared with the time that is needed to search for P in T without compression (Fig. 6). P was randomly chosen as a substring of a given length from T. This experiment was repeated for varying sizes of P. Note that the time to compress P and T was included in calculating the search time for the compressed text search algorithm. To raise the confidence in the results, this process was repeated 1000 times for each length of P and the average time was recorded for both algorithms. It is obvious that when KMP is applied on compressed input, it resulted in significant reduction in search time. The saving in time happened because an equality comparison between two unicode characters, in Java, is actually implemented through a couple of byte-wise equality comparisons. On the other hand, when searching in T compressed , the equality comparison is done by a single byte-wise comparison.

V. CONCLUSIONS
In this paper, the natural language text was compressed to speed up text search. By excluding the upper byte of the unicode of letters, we could reduce the size of both P and T into half and hence have a faster text search. This approach assumes that letters of the text belong to the alphabet of the same natural language. One important result from this research is that exploiting the specifics and constraints of natural languages may open the door for improvements on string algorithms in general. Although these improvements are not generic, they may be useful under certain contexts. One interesting issue to explore is how to do text search when T compressed and P compressed are viewed as arrays of unicodes rather than arrays of bytes. The challenge here is to explore how to compress P such that the odd valid shifts of P within T are also found.