Comparative study of Authorship Identification Techniques for Cyber Forensics Analysis

Authorship Identification techniques are used to identify the most appropriate author from group of potential suspects of online messages and find evidences to support the conclusion. Cybercriminals make misuse of online communication for sending blackmail or a spam email and then attempt to hide their true identities to void detection.Authorship Identification of online messages is the contemporary research issue for identity tracing in cyber forensics. This is highly interdisciplinary area as it takes advantage of machine learning, information retrieval, and natural language processing. In this paper, a study of recent techniques and automated approaches to attributing authorship of online messages is presented. The focus of this review study is to summarize all existing authorship identification techniques used in literature to identify authors of online messages. Also it discusses evaluation criteria and parameters for authorship attribution studies and list open questions that will attract future work in this area.


INTRODUCTION
Cyber crime is also known as computer crime, the use of a computer to further illegal ends, such as committing fraud, trafficking in child pornography and intellectual property, stealing identities, or violating privacy.
Cybercrime, especially through the Internet, has grown in importance as the computer has become central to commerce, entertainment, and government. Senders can hide their identities by forging sender's address; Routed through an anonymous server and by using multiple usernames to distribute online messages via different anonymous channel.
Author Identification study is useful to identify the most plausible authors and to find evidences to support the conclusion.
Authorship analysis problem is categorized as [ Although authorship attribution problem has been studied in the history but in the last few decades, authorship attribution of online messages has become a forthcoming research area as it is confluence of various research areas like machine learning, information Retrieval and Natural Language Processing. Initially this problem started as the most basic problem of author identification of anonymous texts (taken from Bacon, Marlowe and Shakespeare) [1], now has been grown for forensic analysis, electronic commerce etc. This extended version of author attribution problem has been defined as needle-in-a-haystack problem in [2] When an author writes they use certain words unconsciously and we should able to find some underlying pattern for an authors style. The fundamental assumption of authorship attribution is that each author has habit of using specific words that make their writing unique Extraction of features from text that distinguish one author from another includes use of some statistical or machine learning techniques.
Rest of the Paper is organized as follows. Section 2 Reviews existing techniques used for Authorship Analysis along with their classification. Section 3 explains basic procedure for authorship analysis. Section 4 summarizes Comparisons of various techniques since year 2006 till 2012.Section 5 Reviews performance evaluation parameters required for Authorship Analysis Techniques followed by section 6 which is conclusion.

II.
STATE OF THE ART OF CURRENT TECHNIQUES This section gives fundamental idea on existing Authorship Attribution Techniques followed by their comparison in next section. In literature, this problem was solved using statistical Analysis and Machine learning techniques. These are mainly categorized as shown in Figure  1. www.ijacsa.thesai.org

B) B.CUSUM statistics procedure:
In stastical analysis the cusum called cumulative sum control chart, the CUSUM is a sequential Analysis technique used for onitoring change detection. As its name implies, CUSUM involves the calculation of a cumulative sum.

C) Cluster Analysis:
Cluster analysis is an exploratory data analysis tool for solving classification problems. Its purpose is to sort cases (people, things, events, etc) into groups, or clusters, so that the degree of association is strong between members of the same cluster and weak between members of different clusters.

A. Feed-forward neural network :
A feed forward neural network is an artificial neural network where connections between the units do not form a directed cycle. This is different from networks. The feed forward neural network was the first and arguably simplest type of artificial neural network devised. In this network, the information moves in only one direction, forward, from the input nodes, through the hidden nodes (if any) and to the output nodes. There are no cycles or loops in the network.

B. Radial basis function network:
A radial basis function network is an artificial neural network that uses radial basis functions as activation functions. The output of the network is a linear combination of radial basis functions of the inputs and neuron parameters.
Radial basis function networks are used for function approximation, time series prediction, and system control.

C. Support Vector Machines:
In machine learning, support vector machines (SVMs, also support vector networks are supervised learning models with associated learning algorithms that analyze data and recognize patterns, used for classification and regression analysis. The basic SVM takes a set of input data and predicts, for each given input, which of two possible classes forms the output, making it a non-probabilistic binary linear classifier.

IV.
CLASSIC PROCEDURE FOR AUTHORSHIP IDENTIFICATION Figure 2 shows classic approach to model authorship identification problem.  The complexity level of aforementioned problem is determined by the various parameters like the number of authors and size of training set. This both the parameters play vital role to determine prediction accuracy. Although these parameters are considered critical to the complexity of the problem and therefore the prediction accuracy, there are no studies examining their impact on the authorship-identification performance in a systematic way. The problem of authorship attribution is explored well in the area of literature, newspapers etc but limited work has been done for the authorship identification of online messages like blogs, emails and chat. This comparative study concluded that if number of author's increases and size of training sets decreases then performance degrades. Thus, by considering all these parameters further research direction is to improve prediction accuracy.