An Enhanced Malay Named Entity Recognition using Combination Approach for Crime Textual Data Analysis

Named Entity Recognition (NER) is one of the tasks in the information extraction. NER is used for extracting and classifying words or entities that belong to the proper noun category in text data such as person's name, location, organization, date and others. As seen in today's generation, social media such as web pages, blogs, Facebook, Twitter, Instagram and online newspapers are among the major contributors to the generation of information. This paper presents an enhanced Malay Named Entity Recognition model using combination fuzzy c-means and K-Nearest Neighbours Algorithm method for crime analysis. The results showed that this combination method could improve the accuracy performance on entity recognition of crime data in Malay. The model is expected to provide a better method in the process of recognizing named entities for text analysis particularly in Malay. Keywords—Named entity recognition; information extraction; fuzzy c-means; k-nearest neighbors; malay language; crime data


INTRODUCTION
Information is one of the important sources in human life that is increasingly rising and technologically.At all times, various types of information have been generated on the internet and the amount of information is constantly increasing from time to time.Information consisting of various types such as text, images, audio, video, data, and so on are increasingly being generated on the internet which are largely unstructured.This growing number of information affects the daily lives of people in work, learning and lifestyle.Effective management and organization of information is a key strategy for addressing the problem of finding useful information.The appropriate techniques and methods are very necessary to process and extract the essential knowledge contained in this information.
Therefore, this paper presented the Malay named entity recognition using clustering and classification method.The rest of this paper is organized as follows.In Section 2, it discusses the related work for the named entity recognition task.Section 3 presents techniques and machine learning algorithms for NER.Then, Section 4 discusses the Malay NER and follow by its approach in Section 5. Next, the experiment result and discussion are elaborated in Section 6.Finally, Section 5 covers the conclusion.

II. RELATED WORK
Named Entity Recognition (NER) is important in analyzing the crime report to address the problem of crime due to the use of different languages in writing crime reports for each country.When a lot of information relates to crime occurrences are available on the web with many specific entities, many techniques can be used in NER for extracting useful information for better crime analysis and execution actions that explain by Hosseinkhani, Koochakzaei, and Keikhaee [1].Shabat and Omar [2] have implemented NER tasks using an ensemble framework that focuses on designing models to extract specific criminal information from the Web.Their main goal is to integrate the set of features and classification algorithms in an orderly way to synthesize more precise classification procedures.Three base-classifiers specifically Naïve Bayes, Support Vector Machine and K-Nearest Neighbor classifiers are used for each of the feature sets and these three classifiers are combined using a weighted voting ensemble method.
Alkaff and Mohd [3] have analyzed online news, blogs and social networking sites on the internet using gazetteers and rule-based extraction for named entity recognition in identifying crime hot spots.Therefore, an accurate natural language processing technique is needed to be explored to capture and recognize named entity within open domain textual data effectively.
Execution processing recognition named entity analysis requires several steps to achieve the objectives of the research.The steps including the pre-processing stage, the annotation stage and evaluation or developing system stage.Based on Jurafsky and Martin [4] there some basic steps in the statistical sequence labelling approach to creating a named entity recognition system.The following Fig. 1 shows the steps illustration.www.ijacsa.thesai.orgMany method and techniques are being continuously developed which is it more focus on managing of information and knowledge.Previous knowledge management a strongly focuses on just keeping large amounts of data for data mining.Now the growing use of the Internet and the information burden placed a huge demand for managing intelligent information efficiently and effectively.This application of artificial intelligence methods and research in the growing area of human-machine interaction is ahead grounds for more investigations.

A. Rule-based Approaches
In computer science, rule-based systems are used as a means of storing and manipulating knowledge to interpret information in a useful way.They are often used in artificial intelligence applications and research.Normally, the term Rule-Based System ('rules-based system') is used for systems involving a set of man-made rules or rules outlined.Today, these rules-based systems are widely being used and implemented for many kinds of problem and tasks.As developing the text analysis that focuses on NER task, the rulebased approach is used for the recognition of named entities by defining rules regarding the status of entity members' position in the phrase or sentence.The constraints in the implementation of this method lie in the capability of a pattern definition that is usually done by a linguist.Rule-based NER is also too dependent on the language used.
In general, the NER system using a rule-based approach has Part-of-Speech (POS) tagger, sentence or phrase syntax and orthographic, such as word capitalization pattern combined with the data dictionary.Eftimov et al. [4] state that the NER method using a rule-based approach uses a regular expression that combines information from the source terminology and interests of the feature entity.The main drawback of this method is the construction of manual rules, which are timeconsuming and dependent on the domain.Eftimov et al. [4] combined the terminological-driven NER with rules-based NER as their proposed rule-based method called as DrNER extracting knowledge for evidence-based dietary recommendations.The basic structure of the rule-based expert system is shown in Fig. 2.

B. Learning-based Approaches
 Supervised Learning The ability to learn unnamed entities is an essential part of the NER solution.Early studies were mostly based on the supervised learning (SL).The supervised learning algorithm is the process of forming a relationship model and dependence between predictive output and input characteristics so prediction of output values for new data can be predicted based on the relationships studied from previous datasets.Kotsiantis [5] stated that supervised machine learning is an algorithm that generates the general hypothesis based on externally supplied examples and hence is used in making predictions about future instances.In other meaning, the purpose of this learning is to build a brief model that distribute class labels based on predictor features.
Morwal [6], Chopra and Morwal [7] use Hidden Markov in named entity recognition.While Ahmed and Sathyaraj [8] applied maximum entropy to recognize entity sets from a given text such as name, location and organization.With the different variant of SL techniques, it offers tagging words of the test corpus from the define corpus that require a large set of heuristic rules and clusters.One of the learning based approaches for pattern recognition is unsupervised learning (USL).Unsupervised learning is an artificial intelligence algorithm (AI) that performs data isolation in a dataset using unlabelled or classified information where the isolation is based on the hidden features contained in the data.This algorithm acts on this information or data without guidance.The AI system used can arrange information based on similarities and differences in information although no category is provided among the data.The AI system algorithm also acts on data without prior training.Sathya and Abraham [9] stated that unsupervised learning model recognises information based on heuristic patterns and Reinforcement learning learns through trial and error interactions with their surroundings (rewards / penalties).
Unsupervised learning is also used in named entity recognition tasks.This learning-based is one of the approaches in solving the problems encountered in the task of named entity recognition.Li et al. [10] presented the unsupervised NER system without explicit human label efforts named TwiNER for targeted tweet streams in the Twitter application.The system not dependent on unreliable local linguistic features.Furthermore, S. Zhang and Elhadad [11] also proposed an unsupervised approach in the biomedical field for NER task by extracting named entities from biomedical text.This unsupervised approach for NER was conducted using three main step which are seed term collection, boundary detection and entity classification.

 Semi-supervised Learning
Semi-supervised learning is a technique that is a combination of supervised learning and unsupervised learning.A variety of semi-supervised learning method tries to generate high-quality training data automatically from the unlabelled corpus.By using the semi-supervised learning technique, it can produce considerable improvement in learning accuracy.This improvement in learning accuracy can help in the structured process of extracting named entities such as location, person, type of crime and other entities involved in the crime situation more accurately from any unstructured data like email messages, word processing documents and web blogs.
However, traditional semi-supervised learning methods remain to rely on the high quality of the labelled entity to learn the context of unlabelled data in textual data.Fuzzy semisupervised clustering it offers a new opportunity to overcome classical methods and crisp semi-supervised hierarchical clustering.However, fuzzy semi-supervised clustering is still a new subject and not many studies have been done with fuzzy semi-supervised cluster related on named entity recognition in the literature.Diaz-Valenzuela, Vila, and Martin-Bautista [12] use fuzzy semi-supervised clustering approach to classifying scientific publications in digital web libraries.They use the concepts of fuzzy must-link and fuzzy cannot-link constraints for identifying optimum α-cut of a dendrogram.
Castellano, Fanelli, and Torsello [13] use a semi-supervised fuzzy clustering algorithm to group shapes into some clusters.
Each cluster is represented by a prototype that is manually labelled and used to annotate shapes belonging to that cluster.To capture the evolution of the image set over time, the previously discovered prototypes are added as pre-labelled objects to the current shape set and semi-supervised clustering is applied again.Both of these recent studies improve the accuracy of the group clusters under the supervision of a limited number of labelled data.

IV. MALAY NER
This research discusses the overview of Malay language based on some aspects related to this scope.The Malay language is also one of the language fields that get researchers interest to implement the named entity recognition task.It focuses on the identification of proper nouns in Malay.Like other languages, the Malay language also has its own characteristics in the presentation of information based on the order of sentences and the form of words that have certain meanings.The Discussions on the execution of named entity recognition in the Malay language include orthography, morphology, structure, and so on.
Alfred, Chin Leong, Kim On, and Anthony [14] explains that as one of the processes in Text Mining, a named entity recognition is very useful for information extraction by helping user for entities identification and detection like the person, location and organization.They also argue that different NER processes need to be applied to different languages due to morphological differences.So, a Rule-Based Named-Entity Recognition algorithm for Malay articles has been proposed based on a Malay part-of-speech (POS) tagging features and contextual features in dealing with Malay language articles.The use of a set of rules and manually-specified dictionary lists by the human is a method used in the Rule-Based NER algorithm in identifying named entities.Due to the lack of annotated corpus sources for the Malay language which can be used as training data, they have used rule-based methods rather than using machine learning method to identify person, organization and location as three named entities major types.The rule has been made based on the POS-tagging contexts.The F-Measure result's value during conducted the NER experimental was 89.47%.Furthermore, another experiment was conducted by Sulaiman et al [15] to detect Malay named entity recognition.Stanford NER and Illinois NER tools are used to identify the Malay named entity using online news articles as a process of measuring the capabilities of this tool in the identification of Malay entities.Experimental comparisons have found that Stanford NER tends to yield higher results on F1 and Precision than Illinois NER.These two tools, Illinois NER and Stanford NER are developing based on machine learning method.They conclude that, for improvements in the named entity task in Malay, most NER Malays are used rule-based methods.After conducting experiments, they found that both NERs tools showed a low detection result for the Malay corpus because there were many errors when identifying entities.This is because of the morphological differences between Malay and English.www.ijacsa.thesai.orgBesides that, Salleh, Asmai, Basiron, and Ahmad [16] was applied conditional random fields method in developed an automated Malay Named Entity Recognition (AMNER) conceptual model to recognize entities for the Malay language.Current approaches for Malay NER are more using a set of rules and list of dictionaries set by the human to identify entities.These rules work to extract the pattern of an entity such as location, organization and other entities based on their basic pattern.Due to limitation, the libraries or dictionaries used should always be updated for recognizing named entities.The Malay language features as the main factor on their development model as the guidance for the named entity recognition process.There are several structures in Malay language writing as follows.

A. Orthography
In the execution of named entity recognition tasks, one of the things involved is the conventional spelling system of a language called orthography.The Malay language also has its own orthography in the spelling structure.Based on Cho [17], they explain that in the present time, the Latin alphabet has been used for orthography and spelling system for the Malay and Indonesian languages that have been made by Western linguists.Besides that, Zaidi, Rozan, and Mikami [18] stated that with the use of Malay language standard words using 26 letter alphabets known as Rumi in Malay, it is compatible with communication technology and has the potential to use only the text-based features for communicating in Malay.Orthography used in Malay includes spelling norms, hypotheses, emphasis, punctuation, capitalization, fractions of words.

B. Morphology
Furthermore, morphology is also used in the research of named entity recognition.Morphology in linguistics is the study of the words inner structure and word formation that forms the essential part of today's linguistic study.It describes how the words are formed and their relationship to other word focus on the same language.By breaking the words down into smaller, meaningful part, this smallest meaningful part of a word is called a morpheme.Word structure and part of words analyzed by morphology include stems, prefixes, suffixes and root words.In addition, it also sees the part of speech, the way the context can change the word's pronunciation and meaning, as well as the intonation and pressure in one word.

V. A MALAY NAMED ENTITY RECOGNITION APPROACH
The research is conducted through five phases represented in the form of research design.Each phase in the research design is intensively investigated and then used to facilitate the next phase of the research.The Phase One begins with data acquisition, data obtained in the form of web pages and unstructured.The Phase Two is pre-processing data and is followed by a Phase Three that focused on features extraction.Then, the development of the NER Malay model was carried out in Phase Four.Finally, an accuracy of the entity recognition is evaluated in Phase Five.Fig. 3 illustrates the design of the proposed Malay Named Entity Recognition (MNER) approach.

A. Data Acquisition
Based on research design in Fig. 3, data acquisition is conducted in Phase One.Data is obtained from the Malay Crime News PDRM Website in the form of web pages.These web pages contain some elements such as URL links, images, and texts that need to be processed as they are in unstructured form.The page contents are extracts to obtain the required information which as extracted unlabeled PDRM News Texts.

B. Pre-processing Data
Pre-processing involved four tasks towards the data.As the process in Phase Two, the documents that contain many unstructured data need to delimit into meaningful units by performing tasks like tokenization, tabulation values, POS tagging and annotation.Then, after the annotation process was done, the data were divided into two parts: training data and testing data.The following Fig. 4 shows the process for preprocessing data.

 Tokenization
The text data file (.txt) that were presented in unstructured data consisted of sentences and paragraph which were tokenized as the process of separating a text into valuable elements, words, phrases, symbols or digits called tokens.The tokens were presented in a list as the input for further processing.

 Tabulation Values
Next, the token text file was processed to store data in a tabulator structure like spreadsheet data.The file was divided into three rows namely token data, part of speech tag (POS) and named entity tag.Before continuing to the annotation stage, entity tag column was set as default value "O" as outside or other.

 POS Tagging
Every token in the file was also annotated with POS tagging bands such as CC, CD, NN, VB and others.The description of The Penn Treebank POS tagset is based on Table 1.

 Annotation
Then, the file was annotated with entities types.There are five types of entities that are being worked out in this research.Those entities are person name, location, organization, date, and types of crime labelled as PERSON, LOCATION, ORGANIZATION, DATE and CRIME TYPE.For non-entity types, they are labelled as OTHER.The final preprocessing dataset produced is shown in both Fig. 6 and Fig. 7 respectively with their features extraction.

C. Features Extraction
In Phase Three, the process of extracting features for the named entity recognition task has been performed.Feature extraction is divided into two parts.The first part, some features have been extracted for use in clustering process and in the second part; some other features have been extracted for use in the process of classification.The generated feature dataset is produced in this phase for further analysis.The features selected for both parts are as appropriate to carry out the task of recognizing named entities in the Malay language.The details process of extracting these features are discussed as shown in Fig. 5.

D. Malay NER Model Development
Furthermore, in Phase Four, there are two types of learning used, namely clustering and classification.Fuzzy C-Means as clustering method is used to cluster the data either entity or non-entity.After that, the correct entities that have been clustered are labelled based on more detailed entity types which are person, location, organization, date, and type of crime.Then, these entities through the classification process by using K-nearest Neighbors Algorithm Classification.www.ijacsa.thesai.orgThe research proposed the fuzzy c-means method that applies to Malay named entity recognition task.The experiment is conducted by analyses the data that have done the pre-processing stage.The data that consists with features set is processed by using clustering method called as fuzzy c-Means algorithm.Fuzzy clustering is categorized as an unsupervised learning method that influential for data analysis and model's construction.Sakinah [19] stated that the desired number of clusters and preliminary predictions for each grade of membership is the beginning of the FCM algorithm.Therefore, for each cluster, all data points have their respective membership grades.The goal algorithm is to guide the central cluster to the optimum location in the data space by gradually updating the membership grade along with prototype (cluster centers) of the data point.[20] stated that fuzzy c-means use fuzzy division to allow the sharing of data by all groups with different grades of membership between 0 and 1.They explain that the fuzzy c-means algorithm works by providing membership to each data point equivalent to each cluster center.Membership value given was calculated based on the distance between the center of the cluster and data points.The membership value of each data increases according to the closeness of data to the specified cluster center.This fuzzy Cmeans clustering makes a performance to cluster data by iteratively searching for a set of fuzzy clusters and the associated cluster centers which represent the data structure.The K number of the nearest neighbors used has been given first in achieving high precision in the classification and relies heavily on the data set used.As the most basic instance-based method, the data used in the KNN algorithm are represented in vector space.There are two steps that are used in simple K nearest neighbor algorithm, firstly is finding the K training example that is closest to the unknown example and the second step is to pick the most classify occur for these K examples.The following Fig. 9 is the pseudo code of k nearest neighbors algorithm.Based on prediction clustering chart of Fig. 11 and the cluster result in Fig. 13, the overall percentage accuracy had gave markedly good results based on clustering matching with 88.51% due to the calculation from all recall and precision results from all class entities.This accuracy was evaluated according to 17527 data samples, which have been preprocessed and undergone feature extraction.The precision result for NON_ENTITY class is 88.43% with 94.18% recall, whereas the precision for ENTITY class is 88.68% with 78.74% recall.Based on the analysis with other languages including English, NER has been implemented in the Malay language, which has the same characteristics as English in named entity recognitions such as capitalisation feature.

Suganya and Shanthi
Then, for k-NN classification chart and result in the Fig. 12 and Fig. 14 respectively, the prediction of classified entities consists of ORGANIZATION, LOCATION, DATE, CRIME, PERSON and OTHER is evaluated according to precision and recall.For ORGANIZATION entity, the precision is 89.77% and recall is 90.99%.For LOCATION entity, its precision is 80.83% and 72.39% recall.Next, the DATE entity produces 81.72% and 87.36% for both precision and recall respectively.For CRIME type entity, it produces both precision and recall as many as 85.48%.Then, for PERSON entity, it produces 89.27% for precision and 89.96% for recall.Lastly, for OTHER entity, the result for both precision and recall are 97.88% and 98.09% respectively.

VII. CONCLUSIONS
As conclude, the overall accuracy produced for Malay NER analysis is 95.24% during k-NN classification.This accuracy that can be an overall perspective of the evaluation process can be improved by undergoing another experiment by increasing the training dataset for a better result.This is because the percentage of accuracy increment for recognizing Malay entities liable on the model trained and suitable features sets used.The generated model from the small amount of dataset during the training process affected the assessment of the test's results.Therefore, the bigger dataset is needed to develop the Malay model to increase the results.As significant, the produced NER model can help to extract text data by determining exact text or term in the Malay language as named entity for the further police investigation.
In addition, the selection of appropriate features need to be continuously focused as these features can affect the performance of the NER model especially for Malay language because the language has complex structure in sentences.
The proposed Malay NER model can be further improved by increasing the corpus references in Malay for solving the problem of ambiguities for recognizing named entity types in Malay texts.

Fig. 1 .
Fig. 1.Basic steps approach for NER III.TECHNIQUES AND MACHINE LEARNING ALGORITHMS FOR NER

Fig. 7 .
Fig. 7. Sample of Feature Extraction k-NN Classification This method (developed by Dunn in 1973 and improved by Bezdek in 1981) is frequently used in pattern recognition.The following Fig. 8 is the algorithm for fuzzy C-Means clustering.

Fig. 8 .
Fig. 8. Fuzzy C-Means Clustering Algorithm  K-Nearest Neighbors Algorithm Classification is a machine learning technique in a supervised learning category that can be used to develop a model that describes the classification of important data.The development of the classifier is based on the class attributes involvement.Another method used in this experiment for classification is by using the K nearest neighbors algorithm.In pattern recognition, k-Nearest neighbors (k-NN) is one of the

Fig. 9 .
Fig. 9. Pseudo code of k Nearest Neighbors algorithmVI.RESULT & DISCUSSIONThe collection of data is produced from PDRM news web pages in Malay languages cover on a few categories such as general topics, sports, crimes and others.Examples of the dataset before pre-processing are shown in Fig.10and after pre-processing in both Fig.6and Fig.7respectively.

Fig. 10 .
Fig. 10.Example of the dataset before pre-processing phase

TABLE I .
THE PENN TREEBANK PART-OF-SPEECH TAG SET