Implementation of Pattern Matching Algorithm for Portable Document Format

Internet availability and e-documents are freely used in the community. This condition has the potential for the occurrence of the act of plagiarism against an e-document of scientific work. The process of detecting plagiarism in some cases seems to be done manually by using human power so that it has the potential to make mistakes in observing and remembering the checkpoints that have been done. The method used in this research is to represent two sets of objects compared in the form of probability. In order for the method to run perfectly, the Rabin-Karp algorithm is applied, wherein Rabin-Karp is a string matching algorithm that uses hash functions as a comparison between the searched string (m) and substring in the text (n). If both hash values are the same then the comparison will be done once again to the characters. The resulting system is a web-based application that shows the value of the similarity of two sets of objects. Keywords—Pattern matching; Rabin-Karp algorithm; data mining; web


INTRODUCTION
Plagiarism turns out to infect developing countries like Indonesia.Some recent cases are even found in developed countries like the United States.The difference is that developed countries impose sanctions that do not play games with plagiarism, while Indonesia still seems shy to impose tough sanctions because most of the scientific work has not been protected by Hak atas Kekayaan Intelektual (HaKI) then plagiarism is classified as an academic crime that including as ethical violations and difficult to be criminalized.As the first step to prevent a similar case is needed how to detect the possibility of such plagiarism in the college environment that is primarily on the final outcome of undergraduate candidates and undergraduate thesis of master degree and doctoral dissertation candidates who are prone to plagiarism [1].
There are two main classes of methods used to reduce plagiarism: methods of preventing plagiarism and methods of detecting plagiarism.Prevention methods of plagiarism include ritual punishment and complementary procedures of plagiarism explanation.This method has a long-term positive effect, but it takes a long time to implement because they rely on social cooperation between different universities and departments to reduce plagiarism [6].Plagiarism detection methods include manual methods and software.They are easy to implement but have a momentary positive effect.Both methods can be combined to reduce cheating and cheating.Although software is the most efficient approach to identifying plagiarism, the final assessment must be done manually [7].
To minimize the practice of plagiarism, detection of writing is required.To overcome the practice of plagiarism, it is not enough to simply remind the students that plagiarism is not well done.The detection of plagiarism practices is the best solution so that the fraudulent actions can be minimized.However, manual detection is difficult to do because of a large amount of writing.So the system needed to detect plagiarism.Methods for detecting plagiarism can be classified into three methods: full-text comparison method, fingerprinting document method and keyword equality method [1].
Rabin-Karp algorithm is a string-matching algorithm that uses hash functions as a comparison between the search string (m) and substring in a text (n).The Rabin-Karp algorithm is based on the fact that if two strings are equal then the hash value must be the same.But there are two problems that arise from this, the first problem is that there are so many different strings, this problem can be solved by assigning multiple strings with the same hash value.The second problem is not necessarily a string that has the same hash value matching to overcome it for each string that is assigned to do string matching by BruteForce [1], [3] II. RESEARCH METHOD Similarity measurement methods have been developed with various methods applied.Although each method has its own way of measuring but the results to be achieved remains the same that is to create a system that can measure the level of similarity in the text string in an optimal and effective [1].
There are three kinds of techniques that are built to determine the value of similarity (similarity) of documents, such techniques are [1], [2]: Rabin-Karp algorithm is included in the category from left to right.The Rabin-Karp algorithm implements a hash function that provides a simple method to prevent the time complexity Θ(m2).There are four categories of comparison process [3]: We use the Rabin-Karp algorithm to compare the pattern of files uploaded with servers on the server.This comparison yields a percentage value of the similarity of uploaded files to files contained on the server.This comparison is performed by preprocessing steps shown in Fig. 1: case folding, tokenizing, filtering and stemming.

A. Case Folding
In this process, we make changes to the words in the document into lowercase (a to z) [4].

B. Tokenizing
We do a cut to the input string based on the specified delimiter.Characters other than letters will be considered as delimiters and will be omitted or deleted for the process of getting text compiler words.From this process will be generated words string or text compilers or often called tokens or term [4].

C. Filtering
We remove the words that have been registered into the stop-word or stop-list.Stop-word is the words that often appear in the text in large numbers and is considered to have no significance [3].

D. Stemming
This process we do to get the basic word from a word.Stemming Nazief-Adriani is a stemming algorithm created by Bobby Nazief and Mirna Adriani [8].

E. Rabin-Karp
By seeing that the two strings are the same, the hash value must be the same.But there are two problems that arise from this, the first problem is that there are so many different strings, this problem can be solved by assigning multiple strings with the same hash value [5].

F. Similarity Value Measurement
Measuring similarity and distance between two information entities is a key requirement for the discovery of information.The first stage is dividing the word into k-grams.Second, group the term results from the same k-grams.Then to calculate the similarity of the word set then used the formula 1 Dice's Similarity Coefficient for the word pairs are used [9].

G. Similarity Value Percentage
To determine the similarity between existing documents 5 types of understanding percentage similarity [5]:  0%: the 0% test result means the two documents are completely different in both the content and the sentence as a whole.
 < 15%: Test results less than 15% means the two documents have little in common.
 15 -50%: Test result means that the document includes a moderate plagiarism.www.ijacsa.thesai.org > 50%: Test results over 50% means it can be said that the document detects plagiarism.
 100%: Test results with a percentage value of 100% indicate that the document is a plagiarism because from the beginning to the end have the exact same content.

IV. RESULT
At the beginning of the application selected one of the detection methods, namely detection by using the title, the content of the content as in Table 1 below: The first process, the process of preparation is done the tokenizing process, filtering and stemming process results shown in Table 2   The second process as shown in Table 3 below is a process of parsing K-gram with length K = 4.Here is a hashing calculation by converting char to decimal based on ASCII with K-gram = 4 and Modulo = 101.The result of this hashing calculation is shown in Table 4.The third process shown in  The fourth process, to obtain similarity level information is weighted using Dice's Similarity Coefficient [10]:

V. CONCLUSION
Based on the series of tests we have done, our system can provide a true value of scientific paper data by using k-gram and hashing parsing to find matches of the same word or phrase in the document being tested.Rabin-Karp algorithm modification of time processing process similarity (running time) better.The system has been able to check the title of scientific papers, abstractions or documents comparable with the existing comparative documents on the database with accurate.The checking system at document similarity level with Rabin-Karb algorithm gives a result of similarity percentage and detection notification.
From right to left  From left to right  In specific order  In any orderThe key to the efficient Rabin Karp algorithm is in its hash value selection.One well-known and effective way is to treat each substring as a number on a specific basis.The hash function should provide at least four properties[4]:Able to perform computing efficiently  High string discrimination  The hash function (s[i+1...i+m]=s[i...i+m-1]s[i]+s[i+m]) should be easy to compute from: a) Hash (s[i...i+m-1]) b) Hash (s[i]) c) Hash (s[i+m])  The Rabin-Karp algorithm marks the following steps: a) Apply hash function b) The preprocess phase in the time complexity Θ(m) and time constant c) Search phase in time complexity Θ(m)  Θ(n+m) estimates the active time III.PROPOSED SYSTEM

Fig. 2 .
Fig.2.Results of parsing, hashing key and fingerprint against PDF docs on our system.

Fig. 3 .
Fig. 3. Notice of plagiarism with color and result percentage of similarity in our system.
similarity measure, which measures the similarity of two objects in terms of the geometric distance of the variables enclosed within the two objects.
 Feature-based similarity measure, which is to calculate the level of similarity by representing the object into the form of features that want to be compared.The featurewww.ijacsa.thesai.orgbasedsimilarity is widely used in classifying or pattern matching for images and text.Probabilistic-based similarity measure, which calculates the level of similarity of two objects by representing two sets of objects that are compared in the form of probability.Includes Leibler Distance Kullback and Posterior Probability

TABLE I
below:

TABLE III .
RESULTS OF K-GRAM PARSING

TABLE IV .
CALCULATION RESULTS MODULO AND REMAINDER Table 5 below is the result of calculating the values found in Table 4 that are matched by matching string by taking the value of match yes.