Legal Entity Extraction: An Experimental Study of NER Approach for Legal Documents

—In legal domain Name Entity Recognition serves as the basis for subsequent stages of legal artificial intelligence. In this paper, the authors have developed a dataset for training Name Entity Recognition (NER) in the Indian legal domain. As a first step of the research methodology study is done to identify and establish more legal entities than commonly used named entities such as person, organization, location, and so on. The annotators can make use of these entities to annotate different types of legal documents. Variety of text annotation tools are in existence finding the best one is a difficult task, so authors have experimented with various tools before settling on the best one for this research work. The resulting annotations from unstructured text can be stored into a JavaScript Object Notation (JSON) format which improves data readability and manipulation simple. After annotation, the resulting dataset contains approximately 30 documents and approximately 5000 sentences. This data further used to train a spacy pre-trained pipeline to predict accurate legal name entities. The accuracy of legal names can be increased further if the pre-trained models are fine-tuned using legal texts.


INTRODUCTION
Artificial Intelligence (AI) has the potential to improve both the efficiency and accessibility of numerous legal processes [1]. In the current digital era, online document collections are growing rapidly. Technology and automation can help to extract information from these collections. As the amount of data continuously increasing, it is more and more necessary to access and process these data. The use of natural language processing is significant. NER, one of Natural Language Processing's (NLP) fundamental building blocks, can be used to develop AI applications in the legal domain [2]. Name entity recognition is a process of locating and classifying named entities in an unstructured text into predefined categories.
Name entity recognition is used to find a link to rigid notations in text that are related to well-known semantic classes like person, place, organization, etc. NER is used not only as a standalone tool for information extraction (IE) [3], but also in a variety of natural language processing (NLP) applications such as text understanding, information retrieval, automatic text summarization, question answering, machine translation, and knowledge base construction, and many others. Information retrieval, question-and-answer systems, machine translation, and many more applications use NER as a crucial pre-processing step [4].
To achieve high performance in NER, large amounts of knowledge in the form of feature engineering and lexicons have traditionally been required [5]. Also, there is great advancement in machine learning algorithms and deep learning algorithms in natural language processing and more specifically name entity recognition and information extraction [6]. Depending on the problem, such methods typically require a large set of manually annotated data, [5] whereas some machine learning algorithms rely on unsupervised techniques that do not require a large set of annotated data. There is an active learning-based clustering technique that is a subset of the semi-supervised technique and is used to reduce manual annotation time [7].
Annotation is a practice of adding linguistic and interpretive information to an electronic corpus of spoken or written linguistic data. Basically, annotation means adding a note to the input data. Annotation of words and characters are quite common for exactly distinctive medical specialty entities, resembling genes, proteins, and diseases [8]. In previous work, Jackson M.Steinkamp, Abhinav Sharma has annotated the unstructured clinical notes to identify the symptoms within the electronic health records. In another work related to the medical name entity recognition have prepared their dataset by annotating notes of pneumonia patients [2]. And, annotation between two words or phrases are also done for syntactic dependencies or identifying relation between two words in a sentence. For new annotation project or for doing annotation from scratch, typically includes a variety of activities including defining annotation schemas [9], developing guideline for annotations and defining entity type assembling appropriate collections of documents, and properly pre-processing those documents and create the final corpus [10].
One of the important tasks while annotation is selecting appropriate annotation tool given the large number of tools available and the lack of an up-to-date list of annotation tools and their respective pros and cons [11]. Therefore, extensive review of available tools must be done to avoid poor decision of selecting tools. Weak decision can lead to the unnecessary wastage of time of installing and converting document to the specific format for tools.
In this task, an extensive review of annotation tools for manual annotation of documents has been presented. The basic requirements for selecting the annotation tools have been defined. To gain a better understanding of name entity recognition, an ER-system for the legal domain has been created. The first step in creating a corpus of annotated judgment papers is to define relevant entities, which can mainly www.ijacsa.thesai.org be categorized into two: domain-specific named entities like legal terms, Act, legal institutes, etc., and general named entities like person, location, date, etc. An ER-system built with a spacy pre-trained model is then presented.
Following contributions were made in this paper:  An extensive review was conducted on manual annotation tools for creating NER training corpus.
 A corpus for legal name entity recognition was created, consisting of 5000 judgment papers with 8 legal named entities.
 An Entity-Relation model was developed based on spacy pretrained model.

II. LITERATURE REVIEW
NER is regarded as a crucial activity in the information extraction process. While numerous studies on name entity recognition have been conducted. Several datasets have been offered over a long period of time. CoNLL' 03 [12], which was taken from a German newspaper and is regarded as a language independent NER dataset, is one of the more well-known datasets. Many datasets from many fields, including medical, law, archaeology, and many more, are afterwards proposed [13].
Early NER systems relied on rules that were created by humans. A rule-based NER system is thought to take a long time to design. Researchers have created a NER System based on a machine learning algorithm to solve this issue. They used a few learning techniques, including supervised learning, semisupervised learning, and unsupervised training. Alex Brandsen et al. [9] have used machine learning approach for predicting name entities from Dutch Excavation reports. Not only machine learning algorithms, but also satisfactory research on NER systems using deep learning algorithms and neural networks, are being conducted. A study conducted by Franck Dernoncourt et al. [14] successfully performs NER using ANN and obtain satisfactory result out of it. Thomas AF Green et al. [15] have included a benchmark CRF-based Entity recognition model of a manually created corpus of job description and achieve accuracy of approx. 60-65 %.
While many type of research on NER is carried out using deep learning and machine learning approach. Very Few studies have been done using pre-trained model like BERT [16]. Mugisha et al. [2] have published a detailed comparison of the neuro-linguistic modelling pipeline for predicting outcomes from medical text notes using patients with pneumonia. Li, Jianfu, et al. [17] in their study, they have finetuned pre-trained contextual language models to support the NER task on clinical trial eligibility criteria. They have systematically explored four pre-trained contextual embedding models for biomedical domain (i.e., BioBERT, BlueBERT, PubMedBERT, and SciBERT).

III. RESEARCH METHODOLOGY
A systematic review of literature on various annotation tools used for NER was conducted. In this paper, the standard SLR procedures as described by the authors Chitu Okoli and Kira Schabram [20] were taken under consideration. This methodology demonstrates a detailed examination of the NER system across multiple domains, as well as a manual annotation tool and various annotation methods.

A. Research Question
One of the crucial steps in a systematic review is the research question. In order to maintain focus at the start of the study, we write research questions (RQ) that will adhere to the review procedure. Table II show the lists of research questions. Table III shows a list of keywords used in the search for the paper from the online library.

1) Choose keyword:
2) Inclusion and exclusion criteria: Only conference and journal papers published in English language within the last five years were considered, and any papers currently under review were excluded. Table IV show the inclusion and exclusion criteria.

IV. REVIEW OF TEXT ANNOTATION TOOLS
The formal description of the text annotation problem and annotation tools was presented, followed by a detailed discussion of the selection criteria for annotation tools. Next, an introduction was provided on the commonly used annotation tools. www.ijacsa.thesai.org

A. Annotation Tools
Data annotation tools are software used to create highquality annotated machine learning training data such as text, images, and videos. There are wide variety of annotation tools from open-source tools that developers can modify accordingly to freeware applications that are free to use. Let us first discuss what is text annotation, what is the need of it and discuss some type of text annotation.
B. Text Annotation: Needs and Type 1) Labelling procedures: Adding labels entails putting a word in a sentence that explains its type. It can be explained using emotions, technical terms, etc. For instance, the phrase "I am satisfied with this product, it is amazing" could be given a label like "happy".
2) Adding matadata: Similar to this, relevant information can be added to the statement "Mahadevapura police have fled charge sheet against the accused alleging that he has committed an offence punishable under Section 354C of I.P.C R/w sec.66(E) of Information Technology Act" to help the learning algorithm priorities and concentrate on particular terms. One might write something like, "Mahadevapura (Location) police have fled charge sheet against the accused alleging that he has committed an offence punishable under Section 354C (Legal) of I.P.C. (Act) R/w sec.66(E) of Information Technology Act (Act)".
3) Now let us discuss in brief some of the types of data annotation: Sentiment Annotation: Sentiment annotation is nothing more than the assignment of labels to feelings like sadness, happiness, anger, positivity, negativity, and neutrality. Any activity involving sentiment analysis can benefit from sentiment annotation. (For example, in retail, facial expressions can be used to assess customer satisfaction.) a) Intent annotation: The intent annotation also identifies the sentences but emphasises the purpose or motivation behind the statement. A message such as "I need to talk to Sam" in a customer service situation, for example, may direct the call to Sam by himself, or a message such as "I have a problem with the credit card" could direct the call to the team handling credit card issues.
b) Named entity recognition: The goal of named entity recognition (NER) is to find and categorise special expressions or predefined named entities in a sentence [21]. It is used to look up words based on what they mean, such as names of people or places. Information can be extracted using NER, together with information classification and categorization.
c) Semantic annotation: It can be also known as meaningful annotation, semantic annotation is the addition of metadata, supplementary data, or tags to text that contains concepts and entities, such as persons, places, or themes.

C. Selection of Tools
Tools that are known and have been mentioned in previous studies were listed. Google, Google Scholar, Scopus, and other online databases were searched for tools mentioned in annotation tool-related publications. There are a wide range of annotation tools available, but for this survey, the tools selected are the most widely used in any domain and meet the criteria.
There are a few requirements that have been studied and presented for an annotation tool. In this research, total 22 criteria are considered to evaluate annotating tools which are further divided into different groups such as input-output, publication, system criteria, and function. For these categories important features of tools are considered such as accessibility, usability, and cost. All these mention categories are listed in Table V along with the associated criteria.
The input-output or data criteria address the input-output format of document, schema for annotation, and input format for multi-media file. Publication criteria include the year of last publications, number of citations, and number of publications in last five years. System criteria indicate installations architecture and simplicity of installations, quality and quantity of documents, license of tools and OS support. And the last set of criteria is functional criteria which contains multimedia annotation support, support of multiple language other than English, automatic text annotation, pre-annotation support and data security. The features with which the tools must comply are listed below:  It should be freely available.
 It should be a web application that can be downloaded or used online.
 It should be able to installed easily.
 It should be approachable.
 It should support multiple file format and export annotation in multiple formats.
To satisfy the availability criteria a tool must be instantly accessible, either for direct online usage (via a web user interface) or to download at the time of writing, without requiring consumers to get in touch with the developers. The availability also depends on whether the tool is free or licensed.
The tool must be a web application, which means that it must either be easily accessible online or may be downloaded and installed as a web application. The requirement that annotations be web-based ensures that annotators can focus completely on their annotation tasks without having to fight with tool installation. Manual annotation is a labor-intensive and difficult task in and of itself, and additional work may annoy the annotators and jeopardise the annotation process.
The survey requires the tool to function properly, and it is a requirement for practical experiments. A minimal set of features, as described by the criteria (as defined in Table V), should be accessible regardless of whether the tool is locally installable or accessible online for use. Therefore, there is no need to contact the developers for help because the tool should be simple to use or the documentation should be thorough enough.
Few more additional features are considered in this research other than the functionalities listed above which makes annotation process much easier such as the smallest unit of annotation (character or token), built-in domain-specific named entity extraction, and quick annotations such as keyboard shortcuts, pre-annotations, or ontology.
Some additional feature which are not compulsory for the annotation tools are listed below but they might be useful for most of the NLP based task:  It can support multimedia.
 It can support multiple language.
 Integration with AI model for automatic annotation tools.
 Good and simple User Interface.

D. Selected Tools
In this section total eight tools are studied and selected for the research work are listed in Table VI, detailed discussion is done for the selected tools with respect to their features.

1) BRAT (Browser based rapid annotation tool):
One of the most well-liked tools for manually annotating documents, it has been employed in the creation of numerous corpora. BRAT is a browser-based free online annotation tool for collaborative text annotation [22]. BRAT is not accessible online and must be installed locally. Documents are imported in the same format as the plain text file that contains the schema configuration. It was designed for rich structured annotation for a range of NLP activities. BRAT was created to enhance manual curation efforts and boost annotator productivity using NLP approaches. It is possible to highlight entities and relations as well as normalize data to preestablished terminology. It has a rich range of features such as integration with external resources such as Wikipedia, support for automatic text annotation tools, and built-in annotation comparison.
BRAT is more suitable for annotating expressions and the relationship between them, because annotating longer texts like paragraphs is really inconvenient. It only accepts text files as input documents, and text files are not presented in their original format in the user interface. Despite the fact that the last version was issued in 2012, the product is still readily accessible and well-liked in the industry. Recent upgrades include, among other things, integrating with external TM tools and embedding visualizations in HTML pages.

a) Doccano:
Doccano is an open-source web-based annotation tool for text files only [23]. It is an open-source tool that supports a variety of job types, such as tasks involving the annotation of text sequences or text classification, which may be applied to a variety of problems, such as the annotation of text for sentiment analysis, text summarization, NER, etc. [9] It has a more modern and attractive user interface, and all configuration is done in the web user inter-face. It also generates a basic overview of tagging statistics. All of these make Doccano more beginnerfriendly and user-friendly in general. It supports multiple users, but there are no additional features for collaborative annotation.
b) GATE: Gate team-ware is a web-based open-source collaborative annotation and curation tool [24] and is freely available. Gate teamware is an extension of an annotation tool GATE, which is an annotation management tool. GATE teamware offers user automatic annotation which reduces the manual annotation tool. It offers the interface which can be used to create corpus, to define annotation schema, to load preannotated data. As it is collaborative tool, it allows the users to monitor the annotation process i.e., number of annotated document and remaining document to be annotated. It is also use to monitor statistics like time spent on a document, inter annotator agreement. c) Light tag: Another browser-based text labelling tool is LightTag [25], however it's not completely free. No local installation is required for annotation using lighttag. It offers a free edition with 5,000 annotations each month for its essential features. It supports working with different languages (like Arabic, Hebrew and CJK among others), document level, multi-word, nesting, relationship annotations, etc. Additionally, it uses machine learning to learn from active annotators and suggest possible annotations for hidden text. It assigns tasks to annotators and ensures that there is enough overlap and duplication to maintain a high degree of accuracy and consistency. d) Prodigy: It is a paid tool, and the only free version is a demo. Prodigy is an active learning-based annotation tool that is also connected with the Spacy library. This annotation tool's active learning feature allows you to only annotate cases for which the model does not yet have an answer, greatly accelerating the annotation process. By using transfer learning technology and a more flexible approach to data gathering, you can train models of production quality with a minimal number of samples. Prodigy allows you to annotate images, videos, and audio in addition to text. When exporting your files, you can select among the JSONL, JSON, and txt formats. e) UBIAI: UBIAI is a powerful labelling platform for training and deploying custom NLP models. UBIAI is a tool for data labelling as a service category in the technology stack [26] . It offers free and paid plans, OCR annotation tools, document classification, auto-tagging for team collaboration, and more. Widely used in the corporate world to convey important information, this is a must, especially for businesses and organizations that need to create high-quality annotations to PDFs, but difficult to edit there is. With UBIAI you can easily annotate native his PDF documents, scanned images, images, invoices or contracts in over 20 languages including Japanese, Spanish, Arabic, Russian and Hebrew can be attached. Per-form named entity recognition (NER), relationship extraction, and document classification in the same interface. Export annotations in multiple formats including Spacy, IOB, and Amazon Comprehend. Supports various input formats such as native PDF, TXT, CSV, PNG, JPG, HTML, DOCX, JSON. It also offers team management features that allow you to track progress. Measurement of text annotations, performance of assigned projects, and agreement among annotators.
f) Label studio: Label Studio is an open-source data labeller that allows you to label and explore a variety of data written in Python. You can make different entries with several data formats. You can also integrate Label Studio with machine learning models to provide label predictions (examples) or perform continuous active learning. Label Studio is also available in Enterprise and Cloud versions with additional features. Simplicity of label studio is that it has no complicated configurations, and ease of integration into Machine Learning pipelines. Label Studio can be used in different places, depending on different use-cases. It is quickly configurable for many data types. The tool gets ready in a few minutes. There is an easy way to switch between labelling texts, audios or images, or even annotating all three types at the same time. Many existing labelling frameworks accept only one data type, and it becomes tedious to learn a new app each time whereas Label Studio works with Texts, Images, Audios, HTML documents and any imaginable combination of annotation www.ijacsa.thesai.org tasks like classification, regression, tagging, spanning, pairwise comparison, object detection, segmentation and so on. After configuring what the labelling interfaces should look like, you can import your data. The web import supports multiple formats: JSON, CSV, TSV, and archives consisting of those.

A. Annotation of Dataset
The following section describes the premise for dataset annotation, including the defining of annotation setups, various entity types, and annotation method (Annotation guidelines).

1) Selecting suitable input documents for annotation:
In order to construct a robust dataset for legal named entity recognition, a comprehensive effort was undertaken to collect a diverse range of case documents from the Indian Supreme Court and several High Courts throughout India. The documents were sourced from a multitude of publicly available repositories on the web, including the official websites of these courts and prominent legal databases such as https://www.indianka noon.org, as well as numerous other legal repositories. The dataset was made sure to represent a wide range of court cases accurately and thoroughly from various jurisdictions through a long and complex process of data collection.
2) Annotation setup: Open-source data labeller Label Studio was used as an annotation tool. After comparing the system to other tools (as previously indicated), it was discovered that this was the most straightforward, userfriendly, and effective tool for our experimentation. There are several methods for installing label studio, including installing with pip, installing with docker, and installing from source, whether you are installing it locally or in the cloud. The only need for label-studio is that Python 3.6 or later must be installed on a machine running Linux, Windows, or MacOSX. Port 8080 is expected to be open by default in Label Studio. Label Studio installation needs SQLite 3.35 or later and PostgreSQL version11.5 or above. After installing Label Studio using pip, data was uploaded and entity types were defined in the tool after the system was downloaded and launched on a local machine.
3) Entity type: The targeted entities are listed in the Table VII, along with a brief description and an example for each category. After talking with legal experts on the pertinent information that may be gleaned from court rulings, the entity kinds were established. Fig. 1 explain sample example of document to be annotated. The highlighted part of the text indicates the name entity to be annotated. The name entities that can be extracted from above text are given in Table VIII.

4) Manual annotation process:
The annotation for the judgment text was done at the sentence level, therefore each judgment sentence was given separately from the annota-tion without document-level context. In the event that extra background information is required for annotation, the whole judgment text is also available. The indiankannon URL was used to obtain the whole judgment text.
To Label and annotate data we have use the open-source data labelling tool, i.e., Label Studio. After importing your data, you can start labelling and annotating your data. Fig. 2 conceptualised name entity recognition using machine learning algorithm and manual annotation.
a) Open a project in Label Studio and optionally. b) Click Label All Tasks to start labelling. c) Use keyboard shortcuts or your mouse to label the data and submit your annotations. d) Follow the project instructions for labelling and deciding whether to skip tasks. e) Click the project name to return to the data manager.

5) Annotated corpus statistics:
In this paper, a dataset of annotated judgment text with seven entities has been created.
A dataset of almost 5000 Indian judicial judgment sentences with seven entities has been created. The Table IX lists the number of documents, sentences, and tokens in the annotated corpus as well as other general statistics.

B. NER Model
Several well-known NER model architectures were explored to identify legal named entities in judgment papers. Initially, spacy's pre-trained NER model was used to implement Legal NER. Two of spacy's pre-trained pipelines, namely en_core_web_trf and en_core_sci_sm, were integrated with unique rules created specifically for the legal domain to improve the accuracy of predictions.
During the training phase, the model's predictions were iteratively compared to the reference annotations to calculate the gradient of the loss as shown in Fig. 3. Backpropagation was then used to determine the gradient of the weights using the gradient of the loss. This approach enabled us to determine how to adjust the weight values so that the model's predictions gradually resembled the reference labels, hence enhancing the model's accuracy.
To make sure that our Legal NER model was optimized for the needs of legal named entity identification, we used a strict and systematic methodology. Our algorithm is capable of accurately identifying many different types of legal entities, such as court names and legal terms.  782 | P a g e www.ijacsa.thesai.org

C. Results
Various evaluation matrices were used, such as the F1 score, recall, and precision, to evaluate the model's efficiency. These metrics provide important information about how well the model can identify and categorize data points. The F1 score represents the harmonic mean of accuracy and recall, where recall represents the proportion of true positive values that the model correctly identified and precision represents the percentage of true positive values that the model correctly recognized. A variety of measurements can be utilized to better understand the model's advantages and disadvantages, which will help in deciding how to enhance its performance.
(1)  The Fig. 4 shows the comparison of the performance of the model on individual entities. Since the dataset is completely unbalanced, precision, recall, and F1 score have been calculated for comparison. Precision and recall are defined in terms of true positive, false positive and false negatives, whereas the F1 score is defined as the harmonic mean of precision and recall. The weighted average of precision, recall, and F1 score for spacy en_core_web_trf pipeline are 0.60, 0.41, and 0.48 respectively, and for spacy en_core_sci_sm are 0.51, 0.40, and 0.45 respectively. Good results have been obtained from the experiments and evaluations, and the Legal NER model can be a valuable tool for a variety of legal applications such as legal information retrieval, document summarization, and more.

VI. DISCUSSION
In the legal domain, NER is typically used for tasks such as document classification, contract analysis, and case law research [27]. The accuracy of NER is crucial in the legal domain, as incorrect recognition of entities can lead to incorrect legal decisions [28].
There are several challenges in NER for the legal domain compared to other domains [29]. Firstly, the language used in legal documents is often complex and technical, which might be difficult to identify with traditional NER models. Secondly, legal named entities can have multiple forms and variations, such as acronyms, abbreviations, and synonyms, requiring NER systems to have a comprehensive understanding of legal terminology. To solve this issue, NER models in the legal domain are frequently fine-tuned using massive annotated legal corpora, which can increase the accuracy of legal entity recognition [30].
Another challenge in NER in the legal domain is the presence of named entities with several mentions, such as the names of legal parties. These entities may be referred to by multiple names or titles in different places of the document, making proper identification difficult. To overcome this problem, NER models in the legal sector typically include named entity disambiguation approaches, which assist in the identification and resolution of ambiguity in named entities.
Despite these challenges, NER has proven to be a valuable tool in the legal domain. By automating the process of identifying named entities [31], NER can significantly reduce the time and effort required for legal research and analysis. This can result in increased efficiency and productivity for legal professionals, as well as improved accuracy and consistency in the analysis of legal data. Overall, NER in the legal domain is a critical tool for facilitating legal research, analysis, and decision-making. With advances in machine learning and NLP techniques [32], NER models in the legal domain are becoming more accurate and efficient, helping to make the legal process faster and more effective.
Name Entity Recognition has great potential to improve the process of legal research and analysis, but it faces significant challenges in the legal domain due to the complexity and technical nature of legal language [33]. Further development and refinement of NER systems for the legal domain will likely result in even greater benefits for legal professionals in the future. Once these entities have been extracted and tagged, they can be used for research and analysis of legal texts. Furthermore, policy-making can be informed by the knowledge gained by Legal NER. Overall, the use of LNER in legal research and text analysis can enhance legal research, inform policy decisions, and result in more efficient and fair legal systems. www.ijacsa.thesai.org VII. CONCLUSION AND FUTURE WORK In this paper, a corpus of Indian judgment papers is presented that is annotated with 7 distinct types of entities and can be used to identify legal named entities. In order to create the annotated dataset, a variety of annotation tools were reviewed. 30 court documents that are available publicly were manually annotated. With the dataset, a spacy model was also trained utilizing the trained NER pipelines en_core_sci_sm and en_core_web_trf. The model displays an F1-score of almost 60%, indicating that the dataset has better quality. It is believed that the dataset will be useful for additional NLP tasks on Indian judicial material, such as relationship extraction, knowledge graph modelling, extractive summarization, etc.
In terms of future work, the author will explore approaches for extending and further optimizing the dataset. They will also perform additional experiments with more recent state-of-theart approaches. The researchers plan to produce a CSV version of the dataset, which will simplify the data format, enhance compatibility, facilitate data pre-processing, and enable data analysis.