Conditional Text Paraphrasing : A Survey and Taxonomy

This work introduces a survey for the Text Paraphrasing task. The survey covers the different types of tasks around text paraphrasing and mentions the techniques and models that are regularly used when approaching towards it, alongside the datasets that are used while training and evaluating the models. Text paraphrasing has an effective impact when it is used in other applications, so, the paper mentions some text paraphrasing applications. Also, this work proposes a new taxonomy that it is called Conditional Text Paraphrasing. To the best of our knowledge, this is the first work that shows varieties and sub-problems of the original text paraphrasing task. The target of this taxonomy is to expand the definition of the text paraphrasing by adding some conditional constraints as features that either control the paraphrase generation or discrimination. This expanded definition opens in mind a new domain for research in Natural Language Processing (NLP) and Machine Learning. Finally, some useful applications for the conditional text paraphrasing are represented. Keywords—Natural Language Processing; Text Paraphrasing; Conditional Text Paraphrasing


A. Problem Definition
Text Paraphrasing is a core and challenging problem in Natural Language Processing.The problem refers to texts that convey the same meaning but with different expressions.It can be considered as a transformation for a given text while keeping its semantic meaning.These transformations may be at the level of the texts linguistics and structure.Paraphrasing differs from Entailment in the type of relationship between instances.Entailment occurs when one may draw necessary conclusions from an input instance.For example, Rocky is a Dog.Rocky is an animal.Entailment has a form of "If A then B", while Paraphrasing has a form of "A is B".Entailment is a one-way relationship."If A then B" is not "If B then A".For instance: Rocky is a Dog means Rocky is an animal, while Rocky is an animal does not necessarily mean that Rocky is a Dog as it is maybe another type of animals.Unlike Entailment, Paraphrasing is a two-way relationship.For instance, What is the distance between Earth and Sun? is the same of How many miles between Earth and Sun?
This work demonstrates the defined tasks around the text paraphrasing problem (section 1.2) and formulates discrimination and generation tasks, alongside mentioning some existed researches and efforts that are done on both directions (sections 2.1 and 2.2).After that, some evaluation metrics are represented that are regularly used when evaluating the Fig. 1.Text Paraphrasing Tasks model (section 3) and the datasets used (section 4).We talk about several applications on which text paraphrasing is used, either for data augmentation or as a module in large systems (section 5).Finally, a proposed definition for conditional text paraphrasing is introduced and its taxonomy (section 6), alongside showing some important applications on it (section 7).

B. Defined Tasks
As shown in Figure 1, text paraphrasing is a type of problems at which natural language processing and machine learning could co-operate to solve it.Text paraphrasing involves two different tasks, Discrimination and Generation.The target of the discrimination is to check if the two given texts are paraphrased texts or not.In that case, the task is considered to be a discriminative problem.In the generation, the target is to generate text(s) given a reference text, in that case, the task is a generative problem.

Some researches look to this problem as a Semantic
Text Similarity problem on which some distances metrics, such as Euclidean and Cosine distances are used [1], [2], alongside either using binary vectors as feature representation for sentences extracted from lexical-based features or TF-IDF representation [3].However, the great successes of distributed words and sentences representation [4], [5], [6] altered the basic representations for texts to be used in distances measure [7].
Other researches look to the problem as a supervised learning problem.The problem is often formalized as a binary classification problem y = {0, 1}.Like in [8] on which Support Vector Machine (SVM) is used with basic features representations for sentences.The great successes of deep neural networks in fields like natural language processing, computer vision and speech recognition in both supervised and unsupervised learning problems was a motivation to build a neural-based model for such classification problem [9].Recently, Convolutional Neural Networks (CNNs) showed remarkable results in text modeling for classification and features extraction [10], [11], so that it could be used for this task like in [12].Other works focus on recurrent based models like in [13] on which Long Short-Term Memory (LSTM) is used to encode the sentences embedded into a Siamese network structure [14].Shown in Figure 2, this is the general pipeline for the discrimination task.
In conclusion, currently, deep neural networks are heavily used for the identification task, alongside the currently advanced words and sentences representations.

B. Generation Task
In the generation task, given a reference sentence S 1 , where S 1 = {w 1 , w 2 , .., w n }, the target is to generate candidate(s) sentence(s) that are semantically equivalent to the reference sentence.It is considered to a text generation problem.
Classically, some lexical-based features and wording replacement are used to generate alternatives to the reference sentence.For instances, paraphrases are generated using templates extracted from WikiAnswers repositories like in [15] and lexical-based rules like in [16].They make use of WordNet to get words hypernyms and synonyms for replacements, however, these techniques suffer from the generation of poor candidate paraphrases.
Recently, the great successes of the deep generative models [17] such as Variational Autoencoders (VAEs) [18] and Generative Adversarial Networks (GANs) [19] had a great impact in unsupervised learning problems.Several research worked on generation realistic texts either for task specific problems, such as machine translation [20], [21], and question generation [22].Generic text generation has been investigated using VAEs [23] and GANs [24].As these models depend on the hidden representation of sentences during the training, the produced texts are randomized and uncontrollable.Serious attempts were made recently to control the generated sentences [25], [26], using some conditional features such as the sentence's polarity and syntax-tree [25], [26].Typically, this problem is considered to be sequence-to-sequence problem [27] on which the target is to generate sequence(s) of words given other sequence of words [28], [29], [30].Shown in Figure 3, this is the general pipeline for the generation task with highlights of the most used techniques nowadays.

III. EVALUATION MEASURES
Evaluation metrics are performed on the discrimination and generation tasks.As the discrimination task is a supervised learning problem, metrics such as accuracy, precision, recall and f1-score are used to evaluate the trained models.This is very different from the generation task on which BLEU (Bilingual Evaluation Understudy) [31], ROUGE (Recall Oriented Understudy for Gisting Evaluation) [32], METEOR (Metric for Evaluation for Translation with Explicit Ordering) [33] and Translation Error Rate (TER) [34] are used for approximate all natural language generation tasks.

IV. DATASETS
Compared to other tasks, the datasets for the text paraphrasing task aren't large.This supports the importance of having a robust and generalized text paraphrasing models to help on creating datasets with large diversity for the problem itself and other problems in general such as Sentiment Analysis and Named Entity Recognition.

A. MSCOCO
Microsoft has recently released a dataset for images captions [35].The dataset comes with 120K images that are captioned with short and medium size texts.For each image, five captions are provided that describe the images, and these five captions are written by different five annotators.As the annotators are describing the same things, the captions could represented as paraphrases to each other.

B. PPDB
PPDB [36], [37] is widely known dataset for paraphrase generation.It comes with wide sizes, however the most used size is PPDB 2.0 Large dataset.For some phrases, PPDB has one-to-many paraphrases.

C. Quora Questions Pairs
Quora questions pairs [38] is a dataset produced by Quora.The dataset contains approximately 400K pairs of sentences.Sentences are questions that are labeled by whether the pair is duplicate, has the same semantic meaning, or not.The duplicate questions are considered to be paraphrases if they are duplicates.

D. SNLI
The SNLI dataset [39] consists of approximately 570k sentences.These sentences are generated by human annotators and manually labeled with either they are entailment, contradiction and neutral.For the paraphrasing task, the focus is on the neutral sentences as they describe the paraphrases sentences to each other.SNLI is heavily used in natural language generation tasks and to capture the semantics of languages [40].

E. WikiAnswers
WikiAnswers [41] is a large question paraphrase corpus created by crawling the WikiAnswers website.The paraphrases are different questions, which were tagged by the users as similar questions.The dataset contains approximately 18M question pairs aligned by word.

V. APPLICATIONS
Paraphrasing has numerous applications.It is either used as a preprocessing module to increase the datasets as a Data Augmentation technique or embedded in an end-to-end model.For instances, paraphrasing is used in Question Answering (QA) systems [42], where several paraphrases are generated in an end-to-end neural-based model for the given question to increase the diversity and the coverage of the input.
Paraphrasing is also used to accurately evaluate the Machine Translation models [43].This is done by generating paraphrases that are closer in wording to the translation output based on some lexico-semantic resources such as WordNet.Some researches use paraphrasing to improve the the generated logical form of sentences [44] that is known as Semantic Parsing problem and as a representation for the input queries [45] in Semantic Search.
Paraphrasing has also an important application which is plagiarism detection.In this task the target is check whether two texts are copied or altered to another or not.It is a typical paraphrasing application.It is mainly could be used for author copyrights ownership.Natural language processing suffers from a lack of resources and datasets that could be used to train the models [26], such problem decreases the model's generalization.Several researches use paraphrasing as a data augmentation technique to enrich the datasets [46], [47], [48].Recently, the concept of paraphrasing is used to reformulate the questions to lead for better questions generation and to increase the diversity of questions intents [49].

VI. PROPOSAL: CONDITIONAL TEXT PARAPHRASING TASK
This work proposes a Conditional Text Paraphrasing task as an addition to the original task.As paraphrasing only focuses on the semantics of the sentences, there is a need for more paraphrasing specifications to control the recognition and generation processes.On conditional paraphrasing, the target is to either detect or generate according to specific condition or constraint.The conditions are divided to Morphologybased, Syntax-based and Readability-based categories.This work is driven by several attempts to control the text generation [25], [26].The target of this work is to set and organize the problem as a task that is closely related to the original task.The research is represented on some ways to control the paraphrase generation by creating a taxonomy that defines the task.This may help other researches when they work on it, also, this taxonomy could be expanded for more concrete tasks if needed.
Feasible sentences generation is considered to be a challenging problem in natural language processing because of several obstacles that occur while modeling text.For instances, representing the text in a way of hidden representation that captures the semantics of text and its structure is hard because of the complexity of text and its language.Controlling the text generation means that we need to make the latent representation of sentence capture the semantics, structure and the other disentangled representation of the embedded attribute.For images, some attempts were done to control the generated image with disentangled attribute [50], [51].In general, as the nature of text is a discrete data; training the models that depend  [25] tackled this problem by allocating one dimension of the latent representation to encode the disentangled attributes, such as the polarity in sentiment analysis task, and generates samples with desired sentence semantics.The research focused on disentangled representations of polarity (positive, negative) and Tense (past, present, future) attributes, however, our research believes that this could be generalized further more by expanding the additive attributes that could be conditioned to the model to force it to generate more intensive and concrete samples.These conditions are divided to Morphology-based, Syntax-based and Readability-based categories.These conditioned features control the sentence generation and also the discrimination.In the discrimination task, the conditioned features could be viewed as classes that any paraphrases could be classified to one of them.The taxonomy tree is proposed for the conditional text paraphrasing on Figure 4.
Conditional Text Paraphrasing is divided to three subclasses.These sub-classes controls the paraphrasing of the output sentences, however, these sub-classes are all considered to be paraphrasing on their semantic meaning.In other words, they are all paraphrases that represent the same meaning, but they differ on their morphology, syntax and readability aspects.For instance, consider the given sentence Elon Musk is the founder of SpaceX.The following sentences are candidate paraphrases Elon Musk created SpaceX, Elon Musk is going to create SpaceX, but these sentences are in the past and future tenses respectively.Also, the sentences The founder of SpaceX As humans, it is easy to us to identify whether the text is complex or written by a non-native person, so, the target is to be able to either generate or detect texts that have different styles.
Formally, for the generation task in conditional paraphrasing, given a reference sentence S 1 , where S 1 = {w 1 , w 2 , .., w n }, and in addition, a conditioned class is given as a feature C, the target is to generate candidate(s) sentence(s) that are semantically equivalent to the reference sentence and also applies the conditioning property C.
On the other hand, for the discrimination task, given two sentences (S 1 , S 2 ), where S 1 = {w 1 , w 2 , .., w n } and S 2 = {w 1 , w 2 , .., w m }, the target is to classify whether they are paraphrases or not, further more, it is possible to check what type of paraphrases that they are conditioned on.This transforms the original classification problem to multi-class classification problem.Shown in Figure 5, which is derived from the above taxonomy, the number of parent classes is two, number of conditioning classes is four and the number of conditioning sub-classes is seven.That makes the total number

VII. CONDITIONAL TEXT PARAPHRASING APPLICATIONS
Conditional paraphrasing has several application at which it could be used.Mainly, it could be used in questionanswer generation and domain-specific data augmentation.The constraints could be encoded and fed to the generation model as a conditional feature to handle the generated texts like such approaches [25], [26].Regarding to the discrimination model, the problem turned to be a multi-classification problem.Given that, several transfer learning [52] and multi-task learning [53] could be applied to domain-specific objectives.
Conditional text paraphrasing could be very helpful in learning distributed representations of words and sentence.Currently, supervised learning based models dominates the field of sentences and words representation [40], so, the constraints, specifically syntax-based features, that are provided could help in much better representations for text that would effect on better natural language understanding systems.

VIII. CONCLUSION
This work showed a survey for text paraphrasing and its recent researches and efforts that work in the directions of paraphrase generation and discrimination.Also, it proposed our definition and taxonomy of conditional text paraphrasing task and how its great impact on existed applications and problems.For the future work, we are looking forward to working on datasets for conditional text paraphrasing, alongside doing several experiments for this task.

Fig. 4 .
Fig. 4. Conditional Text Paraphrasing Taxonomy.Reference Text: Elon Musk is the founder of SpaceX