Ranking Attribution : A Novel Method for Stylometric Authorship Identification

Stylometric Authorship attribution is one of the essential approaches in the text mining. The present research endorses a Stylometric method called Stylometric Authorship Ranking Attribution (SARA) overcomes the usual problems which are processing time and accurate prediction results, without any human opinion that relays on the domain expert. This new method also uses the most effective attributes used in the Stylometric authorship prediction frequent word bag counts, whether it was frequent single, pair or trio words attributes, which are the most successful attributes in Stylometric prediction, having more alibi for author artistic writing style for our authorship recognition and prediction proposed technique. The experiments show that the proposed method produces superior prediction accuracy and even provides a completely correct result at the final stage of our experimental tests regarding the dataset scope. Keywords—Data mining; text mining; Stylometric Authorship Attribution; SARA


I. INTRODUCTION
Data mining is the evaluation of observational data units to find authorized relationships and the evaluation of statistics in novel methods that are each obvious and beneficial to the statistics owner [1].Text mining (TM) [2], additionally recognized as understanding discovery in textual database(KDT) [3] or textual content data mining [2], of which new fascinating expertise is created, many defined it also as the process of extracting previously unknown, understandable, achievable and practical patterns or understanding from the series of large and unstructured textual content information or corpus.Text mining uses the same evaluation approach and techniques as statistics mining.However, information mining requires structured data, whilst textual content mining aims to discover patterns in unstructured statistics [4].The problem of text mining has gained growing attention in current years because of the big quantities of textual content data, which created a variety of social network, web, and other information-centric applications.Unstructured statistics is the most natural form of information which can be produced in any application scenario.As a result, there has been an extraordinary need for graph techniques and algorithms which can successfully manner a broad range of textual content purposes [1].Another foremost issue is a multilingual text refinement dependency that creates problems.Only a few tools are available that aid multiple languages [5].Text mining is generally composed of three steps: text preprocessing, text mining operations, postprocessing.Text preprocessing tasks inclusive of information selection, classification and characteristic extraction normally convert the documents into intermediate forms, which have to be appropriate for distinct mining purpose.Text mining operations are the central phase of a text mining system and encompass clustering, association rule discovery, trend analysis, sample discovery and different know-how discovery algorithms.Post-processing tasks manipulate facts or understanding coming from text mining operations, such as comparison and resolution of knowledge, interpretation, and data visualization representation [6].The upcoming sections in this research will illustrate the latest methods and approaches of a certain subfield in the text mining area that is concerned about the text corpus in literature and the writing style of its authors before stepping into the proposed method details.

II. LITERATURE REVIEW
An essential trouble in authorship attribution is the choice of stylometric aspects that are linguistic expressions of unique authors.Sets of proposed facets may vary, depending on accessible data, the supposed generality of their extraction approach and applicability to precise languages.
The easiest elements describe statistical residences of documents: word length, sentence length, and vocabulary richness.Function phrases are points primarily based on word frequencies.In contrast to text categorization problems, where the most established words are considered useless or even unsafe for classification, in authorship attribution problems they are frequently used as non-public fashion markers.However, not all the most universal phrases are exact candidates to be blanketed to that set of features: an important characteristic is an instability [7], i.e. the possibility to be replaced with the aid of every other word from the dictionary.Other word-based elements are phrase sequences (ngrams).An instance of this approach can be observed in [8], the place classification using word sequences used to be examined on 350 poems in Spanish through five authors giving about 83% accuracy.
Features, which normally supply very excessive accuracy measures are personality n-grams, i.e. sequences of n characters extracted from phrases performing in documents.They are considered language independent, i.e. they can be extracted from texts in a variety of languages regardless of persona units used.See, for example, [9] for reviews on authorship attribution of English, Greek, and Chinese texts.In our opinion very accurate effects of their utility need to be handled with caution: there is an apparent useful dependence www.ijacsa.thesai.org between report content and personality n-grams, so they may additionally represent and alternative representation of feature phrases (what is probable good) or they may also simply render document content material (what appears to be worse).
Tareef proposed a new Stylometric approach recognised as the Stylometric Authorship Balanced Attribution (SABA) which in a position to analyze texts in text mining, e.g., novels and performs by means of famous authors, attempting to measure the author"s style, by way of deciding on some attributes that exhibit author's style of writing, assuming that these writers have a one of a kind way of writing that no different creator has, with greater accuracy prediction and impartial from human judgments, which ability that the technique does not count on the domain experts.This method is implemented by using merging three methods, which are called the computational approach, the Winnow algorithm, and the Burrows-delta method.The algorithm regarded an unguided mannequin and it tested in the English language correctly with noticeable prediction [10].

III. STYLOMETRIC AUTHORSHIP ATTRIBUTION
Stylometry is the study of writing style based totally on linguistic elements and is typically applied to authorship attribution troubles [11].
SAA was once begun as a "Content analysis" and was described as "understanding data now not as a series of bodily activities but as symbolic phenomena and to strategy their evaluation unobtrusively.Methods in the natural sciences do now not want to be worried about meanings, references, consequences, and intentions.Methods in social research that derive from these tough disciplines manipulate to omit these phenomena for convenience".The time period content material evaluation is about 50 years old.Webster"s English Dictionary has listed it solely considering 1961 [12].

IV. STYLOMETRIC AUTHORSHIP BALANCED ATTRIBUTION (SABA)
The SABA method is compared towards three different strategies the use of the computational approach, the Winnow algorithm method, and the Burrows-delta method.The results showed that the SABA method produces most useful prediction accuracy and even presents a completely right end result during the closing stage of the test [10].
The SABA method way is by neglecting the maximum values for the attribute frequencies and replacing it with "balanced" frequency.The idea that the right attributes are the "stabilized" or "balanced" attributes rather than attribute with the maximum frequencies.This means that in a written paragraph from a novel with assuming 10000 words, if a specific writer had used a specific word between 200-250 times in all of his books, then consider the attribute "word" has a "stabled" frequency percentage, hence is not a maximum frequency count [10].
V. BURROWS DELTA METHOD While many methods have been utilized to the hassle of computerized authorship attribution, John F. Burrows"s "Delta Method" [13] is an especially simple, yet effective.The purpose is to robotically determine, based on a set of known education archives labeled by using their authors, who the most probably creator is for an unlabeled check document.The Delta technique makes use of the most usual words in the education corpus as the facets that it makes use of to make these judgments.The Delta measure is described as: The suggestion of the absolute differences between the z-scores for a set of phrase variables in a given text-group and the rankings for the same set of word-variables in a target text [14].

VI. METHODOLOGY
Data is taken from the web site www.Gutenberg.org.The dataset is an incredible cross segment of nineteenth century English writing as appropriately as various work.Utilizing this accumulation; we assembled books from 5 of the best 100 most downloaded writers; collected 10 books from every one of the 5 writers and they are Charles Dickens, Jack London, William Shakespeare, Mark twain and Oscar Wilde.
Both algorithms (Burrow-Delta and SABA methods) sharing same first steps, starting by uploading the chosen novels in text mode (with .txtextension), steps of cleaning and chunking are performed (removing double spaces, punctuation marks, special characters, symbol and others) before the implementation of the process of transforming text into Microsoft Access 2010 database files; taking into account that every single record contains frequent or a pair or trio words.
All tests implemented in this experiment by using Microsoft Access 2010 database and Visual C#, and choose ten books for the famous author(Charles Dickens, Jack London, William Shakespeare, Mark twain and Oscar Wilde) (nine for Learn, one for test).

A. Burrow Delta Method
Burrow Delta represents the mean of the outright contrasts between the z-scores for an arrangement of word factors in a given text-gathering and the z-scores for a similar arrangement of word-factors in an objective text.The working steps will be implemented in detail in the case of frequent, pair, trio words.The first step is to transform the book to be tested in text mode (.txt) into a separated list of book words.The final result of this is shown in Fig. 1.This operation will be executed for all learning and testing books.
Next, group the similar records, and calculate to the redundancy of these records, finally store the result in a separate table, the final result of this is shown in Fig. 2. The next step is to cancel the differences between the size of books, by taking the percentage that speaks to the number of frequencies for each property separated by the entirety of frequencies for every one of the qualities multiplied by1000 in order to get a frequency that equal in weight for all used books and give true indication about the style of the author, the final result of this is shown in Fig. 3.
The following are making a stylometric map, by Merging and assembling all of the nine books (learning data) of the author which is being tested in a single table and make a relationship between their fields, calculate the arithmetic average of the redundancies.www.ijacsa.thesai.orgIndex the total arithmetic average descending as shown in the following steps: 1) Merge all the learning data and save the result in a single table.
2) Assemble the result of merging data from the previous table and save the result in a new single table.
3) Make a relationship between their fields, and calculate the arithmetic average of the redundancies.
By calculating the average for all fields of the learning data and sorting it in descending, the stylometric map is ready now for the purpose of testing with other authors" books.
The Stylometric map is prepared for the purpose of examination and testing it, by building connections between the stylometric outline the five test books for all writers to get a new distribution of attributes based on the stylometric map that has been extracted.
In addition, this operation isolates the features that do not participate in any redundancy, that means if there are no common attributes between the learning books and testing books the main attributes will be isolated it by this operation, this step is important in order to make the stylometric map more stronger and reflecting a true style of the author.www.ijacsa.thesai.orgAfter sorting the stylometric database map in the descending order based on the average percentage value for each attribute member in attribution set.
For Pearson, during the last step, select top 300 attributes that have the highest average percentage value in the stylometric map.Extract the Pearson correlation for the particular author"s stylometric map from each of the five test books, hence giving five Pearson values.By having the weights for every parameter, increase each Pearson esteem by -1 on the off chance that it is the wrong creator for the already known outcome or by +1 on the off chance that it is the correct writer.
For Spearman, a new table is configured that consist of 5 maps and 1 test.Each word corresponds to the ratio and the Rank (this rank is based on rank).Then works on it a word search function of the test, search on each map if found, take the rank for that word (only in this map), if not, they are compensated by zero.The result of this procedure is a table consisting of the test words only correspond to the word rank value and the rank of the word that was found at the specific map.The next step is applying spearman equation which also has a range between 1 to -1.
The Spearman connection between two factors is equivalent to the Pearson relationship between the rank estimations of those two factors; while Pearson"s connection surveys straight connections, and Spearman"s relationship evaluates the monotonic relationship.

B. SABA Method
The stylometric authorship balanced attribution (SABA) technique thought about an advancement of the calculation of Burrow-Delta strategy, this strategy relies upon the coefficient of difference (CV), which is spoken to as a factual estimation that isn"t influenced by the perception of mean.Then will be analyzed and tried this calculation in English dialect in the regular, match and trio words.
In SABA technique, the trial of successive, match and trio words is like the Burrow Delta strategy in application, however there is basic contrast between them, precisely while choosing the highest point of 300 characteristics, these determinations rely upon the estimations of coefficient of variety (C.V), the accompanying case visit words can outline the real strides of removing the (C.V) And the strategy for choosing the required properties.
To apply SABA technique, rehash all the past strides as their request in the Burrow Delta strategy, at that point change the last stylometric guide to remove the estimations of the normal, the standard deviation (S.D) and the coefficient of variety (C.V) for every trait in the learning of the data, the (C.V) can be found by isolating the standard deviation by the mean itself, Finally, record the data in rising request in light of the estimations of the coefficient of variety (C.V) and select the main 300 qualities.In the wake of building connections between the last stylometric delineate the test books for all writers as we did on the Burrow Delta test, get the last successive test in SABA technique.
For Pearson, by having the weights for every parameter, duplicate each Pearson esteem by -1 on the off chance that it is the wrong creator for the beforehand known outcome or by +1 in the event that it is the correct creator.
For Spearman, if there are no rehashed data esteems, a flawless Spearman relationship of +1 or −1 happens when every one of the factors is an ideal monotone capacity of the other.It merits saying that the utilization of Spearman is it requires less investment to contrast and Pearson and utilize basic numbers and less unpredictable in light of the utilization of the Rank rather than copies.

A. Burrow Delta Method and Pearson
The first step in this test is done on three authors only was the expectations of true and 0% error rate whether for frequent or pair or trio.
After applying it to five authors, it was found that there was an error of 20%.

 Frequent word
The following tables represent the final results for each author showing the prediction accuracy in the frequent word.The coefficient values in the highlighted cells are the highest value in each row, which indicates a fully correct prediction, as shown in Table I.

 Frequent pair
The following tables represent the final results for each author showing the prediction accuracy in pair word.The coefficient values in the highlighted cells are the highest value in each row, which indicates a fully correct prediction, as shown in Table II.

 Trio word
The following tables represent the final results for each author showing the prediction accuracy in trio word.The coefficient values in the highlighted cells are the highest value in each row, which not indicates a fully correct prediction, as shown in Table III.

 Summary
The results of the prediction for the frequent word and word pair were better than the trio.Although the results of trio words are less accurate than pair and frequent word, because the frequent word results and word pair don"t contain any percentage of error prediction.www.ijacsa.thesai.orgHowever the experiment showed that the frequent word and word pair is the higher predicted values, and represents the best attribute according to the true prediction values for all results.This test use complex equations and numbers and take more time compared with the use of Spearman and Rank algorithm.

B. Burrow Delta Method and Spearman
The first step in this test is done on three authors only was the expectations of true and 0% error rate whether for frequent or pair or trio.
 Frequent word The following tables represent the final results for each author showing the prediction accuracy in the frequent word.The coefficient values in the highlighted cells are the highest value in each row, which indicates a fully correct prediction, as shown in Table IV.

 Frequent pair
The following tables represent the final results for each author showing the prediction accuracy in pair word.The coefficient values in the highlighted cells are the highest value in each row, which indicates a fully correct prediction, as shown in Table V.

 Trio word
The following tables represent the final results for each author showing the prediction accuracy in trio word.The coefficient values in the highlighted cells are the highest value in each row, which indicates a fully correct prediction, as shown in Table VI.

 Summary
The results of the prediction for the frequent word, pair and trio were best possible, because of all results don"t contain any percentage of error prediction.However, the experiment showed that all test have perfect predicted values and represents the best attribute according to the true prediction values for all results.In this experiment the Speed and accuracy at a high rate, using the Spearman equation, which is less complex than Pearson's equation, it takes less time to compare with Pearson, work faster because taking from the test only 300 attributes means we did not adopt all the attributes values.Cancellation of CV and adoption of Ratio, use simple and less complex numbers because of the use of the Rank algorithm instead of the frequencies.Change the experience from 5 test 1 map To 5 map 1 test.It is worth mentioning that in this experiment was obtained perfect results.

C. SABA method and Pearson
 Frequent word The following tables represent the final results for each author showing the prediction accuracy in the frequent word.The coefficient values in the highlighted cells are the highest value in each row, which not indicates a fully correct prediction, as shown in Table VII.

 Frequent pair
The following tables represent the final results for each author showing the prediction accuracy in pair word.The coefficient values in the highlighted cells are the highest value in each row, which not indicates a fully correct prediction, as shown in Table VIII.

 Trio word
The following tables represent the final results for each author showing the prediction accuracy in trio word.The coefficient values in the highlighted cells are the highest value in each row, which indicates a fully correct prediction, as shown in Table IX.

 Summary
The results of the prediction for the frequent word and word pair were worse than the trio.Although the results of trio words are better accurate than pair and frequent word, because the trio word results don't contain any percentage of error prediction.The first contribution is gain, a better prediction accuracy by involving the statistical Pearson correlation and Spearman correlation as a main weighting factor in the SABA and burrows method.And do not overlook that using the Spearman algorithm which is less complex compared to Pearson with the burrows algorithm led to optimal prediction results.The next contribution is improving the feature extraction process by introducing a new set of more dependable attributes, such as the word pair and the trio, in addition to the use of classical frequent words.The results showed that using Spearman correlation coefficients measure leads to, zero error prediction, Speed, and accuracy at a high rate, the Spearman Equation which is less complex than the Pearson Equation and it takes less time to compare with Pearson.The main consideration in this treatise is that the results are best when used ratio rather than CV, use simple numbers and less complicated because of the use of the Rank algorithm instead of frequencies matches.Conducting optimal predictors result in SARA compared with SABA and burrows.Replace ratio value with attribute ranks make the calculations more easy and speedy.

TABLE . I
. PEARSON CORRELATION COEFFICIENT RESULTS IN THE FREQUENT WORD FOR EACH STYLOMETRIC MAP AGAINST FIVE OTHER AUTHORS TEST BOOKS

Pearson in Dickens test Pearson in Shakespeare test Pearson in Wilde test Pearson in London test
III. PEARSON CORRELATION COEFFICIENT RESULTS IN TRIO WORD FOR EACH STYLOMETRIC MAP AGAINST FIVE OTHER AUTHORS TEST BOOKS

TABLE .
IV. SPEARMAN CORRELATION COEFFICIENT RESULTS IN THE FREQUENT WORD FOR EACH STYLOMETRIC MAP AGAINST FIVE OTHER AUTHORS TEST BOOKS

TABLE . V
. SPEARMAN CORRELATION COEFFICIENT RESULTS IN THE FREQUENT PAIR FOR EACH STYLOMETRIC MAP AGAINST FIVE OTHER AUTHORS TEST BOOKS

TABLE .
VII. PEARSON CORRELATION COEFFICIENT RESULTS IN THE FREQUENT WORD FOR EACH STYLOMETRIC MAP AGAINST THREE OTHER AUTHORS TEST BOOKS TABLE.VIII.PEARSON CORRELATION COEFFICIENT RESULTS IN THE FREQUENT PAIR FOR EACH STYLOMETRIC MAP AGAINST THREE OTHER AUTHORS TEST BOOKS IX.PEARSON CORRELATION COEFFICIENT RESULTS IN TRIO WORD FOR EACH STYLOMETRIC MAP AGAINST THREE OTHER AUTHORS TEST BOOKS