Improving Quality of Vietnamese Text Summarization Based on Sentence Compression

Sentence compression is a valuable task in the framework of text summarization. In previous works, the sentence is reduced by removing redundant words or phrases from original sentence and tries to remain information. In this paper, we propose a new method that used Grid Model and dynamic programming to calculate n-grams for generating the best sentence compression. These reduced sentences are combined to text summarization. The experimental results showed that our method really effective and the text is grammatically, coherence and concise. Keywords—Sentence compression; topic modeling; text summarization; Grid model; n-grams; dynamic programming


INTRODUCTION
Text summarization is technique allows computers automatically generated text summaries from one or more different sources.To base oneself on features of the main content and to recapitulate content from original documents that text summarization is one of the fields is interested in researchers from the 60's of the 20th century and it is still a hot topic of the forums and seminars on the current world [1].
The traditional text summarization method usually bases on extracted sentences approach [1], [9].Summary is made up of the sentences were selected from the original.Therefore, in the meaning and content of the text summaries are usually sporadic, as a result, text summarization lack of coherent and concise.Figure 1    Some other text summarization methods are the problem of natural language processing that made summary has a good linguistic score and seamlessly coherence the content of the original.One of its is a sentence compression technique [2], [3], [7].With the compression approach, researchers focused using supervised learning techniques or using legal vocabulary or deep level language analysis techniques based on syntax tree [10].These methods have the following characteristics:  High cost when building the corpus for training.
 Need a long time for construction meticulously by language experts, especially construction corpus related legal vocabulary.
 Higher computational complexity.
Therefore, in this paper, we use a sentence compression method to create a text summary basing on grid model with the target:  Use unsupervised learning to reduce costs.www.ijacsa.thesai.org Use unsupervised learning techniques to not waste time to build corpus crafts.
 Minimize computational complexity by using dynamic programming algorithm.
The rest of the paper is organized as follows: In section 2, we will introduce some related works.In section 3 is the presentation of our method for Vietnamese feature reduction, the methodology of Vietnamese sentence compression is presented in section 4. Experiments and results will show in section 5.And finally, section 6 is a conclusion and future works.

II. RELATED WORKS
The sentence compression task is defined as the curtailment of redundant components, in sentence to produce a shorter sentence.Figure 2    Text summarization is based on a sentence compression approach, allows connecting multiple sentences were reduced to make a shorter document that have the meaning and grammar are accepted, to guarantee a coherent level of content and meaning.Some studies of sentence compression showed the importance of this approach in the problem text summarization.The first people who proposed sentence compression model is Jing and McKeown in 2000, they presented one of the earliest approaches on sentence compression using machine learning and classifier based techniques.This research work was focused on removing inessential phrases in extractive summaries based on an analysis of human written abstracts.In their experiments, they used human-written abstracts, and the corpus was collected from the free daily news and headlines provided by the Benton Foundation [4].
Noisy channel is the typical of the sentence compression method, in studies of Marcu et al., they suggested two methods for sentence compression: one is the noisy channel model where the probabilities for sentence compression (P{compress|S)} 1) are estimated from a training set (Sentence, Sentencecompress) pairs, manually crafted, while considering lexical and syntactical features.The other approach learns syntactic tree rewriting rules, defined through four operators: SHIFT, REDUCE DROP and ASSIGN [9].
In the work of Le Nguyen and Ho in 2004, two sentence compression algorithms were also proposed.The first one is based on template translation learning, a method inherited from the machine translation field, which learns lexical transformation rules, by observing a set of 1500 (Sentence, Sentencereduced) pairs, selected from a website and manually tuned to obtain the training data.Due to complexity difficulties found in the application of this big lexical rule set, they proposed an improvement where a stochastic Hidden Markov Model is trained to help in the decision of which sequence of possible lexical reduction rules should be applied to a specific case [11].Some other works used unsupervised approach.Turner and Charniak, in their work, corpus for training are automatically extracted from the Penn Treebank corpus, to fit a noisy channel model [3], similar to the one used by Knight and Marcu [8].And Clarke and Lapata devise a different and quite curious approach, where the sentence compression task is defined as an optimal goal, from an Integer Programming problem.Several constraints are defined, according to language models, linguistic, and syntactical features.Although this is an unsupervised approach, without using any parallel corpus, it is completely knowledge driven, like a set of craft rules and heuristics incorporated into a system to solve a certain problem [2].
All these works applied to English.For Vietnamese, there are some methods for sentence compression.Minh Le Nguyen et.al and Ha Nguyen Thi Thu et.al. Minh Le Nguyen proposed two methods for sentence compression, one of its applied HMM to Vietnamese sentence compression and other used syntax control for reducing sentences [10], [11].Ha Nguyen Thi Thu using unsupervised learning and supervised learning for creating Vietnamese text summarization based on sentence compression [5], [6].

A. Feature reduction problem
Considering a number of applications such as in a data processing system (the voice signal, image or pattern recognition generally) if we consider setting of features as a set of vectors of real value.Assuming that the system is only effective if the dimension vector of each individual is not too large.
The problem of dimensionality reduction occurs when data have greater dimension processing capabilities of the system [16].Example: A face recognition/classification system based on multi-level gray image that has size mxn, corresponding to mxn dimensional vector of real value.In the experiment, an image may have m = n = 256 or 65536 dimensions.If using a multilayer perceptron network to perform the classification system.It will become difficult to build an MLP.Therefore, the feature reduction is an important problem when we work with the data that has many features such as image, voice, text, ... .

Example 1: table, human, computer,… is the topic words
For the Vietnamese text, some text processing problem often uses word segmentation tool for separate words in text.In previous works, we proposed a method for feature reduction that is published in [5], [6].Documents can reduce complexity computing of large feature set by using a word, segmentation tool for separating word into two word sets: nouns set (called topic word) and other words set.In any text, nouns contain information of the text.So, when we extract nouns from text, a remarkable reduction of large feature set.

Example 2:
Have an original Vietnamese text include 34 words.

"Researchers at the University of Michigan have created a first prototype system for small-scale computing, which can contain data for a week while integrating them into very small parts as the human eyes."
Like this document, we must calculate weight for 34 words.And representation matrix with 1 row and 34 columns like below (1) In Example 2, we separate document d into two sets, the first set include noun and the second set is remain of words.
Use text separation technique in two sets, the size of the matrix T will be reduced, for example, with the original text in example 1, instead of using the T matrix contains one row and 34 columns, we only need the matrix T' consists of one row and 14 columns: (2)

IV. USING GRID MODEL FOR VIETNAMESE TEXT SUMMARIZATION
Methods Vietnamese sentence compression based on unsupervised learning techniques with grid model combined with dynamic programming to choose the best sentence shortened.The calculation is based on the set of noun in a sentence that limit the loss of information in a sentence.
To calculate the probability of a sequence P (w 1 , w 2 , .. , w n ) .Use chain rule of probability:  Suppose there are 11 word in the sentence W , W is represented by : } ,..., , { Table 1 shows compression ratios in the second column, which indicates that the lower the compression ratio the shorter the reduced sentence.The Grammaticality in the third column, which indicates the appropriateness of reduced sentence in term of grammatical.

VI. CONCLUSION
Many Vietnamese text summarization researches were published that showed the importance of Vietnamese information processing problem today.In this paper, the text summarization method based on the sentence compression approach to the target reduce time when applying unsupervised learning method and to not waste cost to build corpus crafts and to reduce computational complexity by using dynamic programming algorithm.The experimental results illustrate our approach is satisfactory requirement of the text summary and can be applied to a number of different languages .

Fig. 3 .
Fig. 3. Dimensional vector reduction model B. Methodology of Vietnamese text feature reduction Define 1: (topic word): Topic word is the nouns that have been extracted from sentences.

Example 3 :Define 3 .
) Apply chain rule for the words, to receive: grams, the conditional probability approximation of the next word in the sequence is ) In this paper, we use bi-gram to reduce complexity in calculation, therefore, when using the bi-gram model to predict the conditional probability of the next word can use approximately formula as follows: substring): Word substring is initialized by a topic word and stop with a topic word, no any topic word between its.In his memory, Flowers bloom along the river.(The most Likelihood word substring) is the word substring in which, every word has maximum bi-gram.
Using word identification and word segmentation tools into two sets of a separate word, inside the collective noun includes words w 4 , w 6 , w 8 , w 9 .The grid model with this sentence is created as Figure4belowIn the figure 4 illustrated a original sentence, that has 11 words.In which, w 4 , w 6 , w 8 , w 9 are all the topic words (nouns).We have 3 word substrings.Reduced sentence can be generated by these steps: Step 1: Let start with the first word substring w 1 ..w 4 :Initialize with w 1 , w 1 has 3 ways: w 1 →w 2 , w 1 →w 3 , w 1 →w 4 .Step 2: Calculate probability of these ways from w 1 .S 12 =w 1 →w 2 , S 13 =w 1 →w 3 , S 14 =w 1 →w 4 .After that, chose the most likelihood probability.
 Step 3: Track the point that has the most likelihood probability.If not the end of word substring, Loop step2: continue to new way from track point to another word in word substring. Step 4: Continue with the next word substring  Step 5: Reduced sentence is the projection of words on the horizontal line.www.ijacsa.thesai.org