Personally Identifiable Information (PII) Detection in the Unstructured Large Text Corpus using Natural Language Processing and Unsupervised Learning Technique

Poornima Kulkarni; Cauvery N K

doi:10.14569/IJACSA.2021.0120957

DOI: 10.14569/IJACSA.2021.0120957

PDF

Personally Identifiable Information (PII) Detection in the Unstructured Large Text Corpus using Natural Language Processing and Unsupervised Learning Technique

Author 1: Poornima Kulkarni

Author 2: Cauvery N K

International Journal of Advanced Computer Science and Applications(IJACSA), Volume 12 Issue 9, 2021.

Abstract and Keywords
How to Cite this Article
{} BibTeX Source

Abstract: Personally Identifiable Information (PII) has gained much attention with the rapid development of technologies and the exploitation of information relating to an individual. The corporates and other organizations store a large amount of information that is primarily disseminated in the form of emails that include personnel information of the user, employee, and customers. The security aspects of PII storage have been ignored, raising serious security concerns onindividual privacy. A significant concern arises about comprehending the responsibilities regarding the uses of PII. However, in real-time scenarios, email data is regarded as unstructured text data, detecting PII from such an unstructured large text corpus is quite challenging. This paper presents an intelligent clustering approach for automatically detecting personally identifiable information (PII) from a large text corpus. The focus of the proposed study is to design a model that receives text content and detects possible PII attributes. Therefore, this paper presents a clustering-based PII Model (C-PPIM) based on NLP and unsupervised learning to address detection of PII in the unstructured large text corpus. NLP is used to perform topic modeling, and Byte mLSTM, a different approach of sequence model, is implemented to address clustering problems in PII detection. The performance analysis of the proposed model is carried out existing hierarchical clustering concerning silhouette and cohesion score. The outcome indicatedthe effectiveness of the proposed system that highlights significant PII attributes, with significant scope in real-time implementation. In contrast, existing techniques are too expensive to function and fit in real-time environments.

Keywords: PII; natural language processing; word2vec machine learning; PII detection; security

Poornima Kulkarni and Cauvery N K, “Personally Identifiable Information (PII) Detection in the Unstructured Large Text Corpus using Natural Language Processing and Unsupervised Learning Technique” International Journal of Advanced Computer Science and Applications(IJACSA), 12(9), 2021. http://dx.doi.org/10.14569/IJACSA.2021.0120957

@article{Kulkarni2021,
title = {Personally Identifiable Information (PII) Detection in the Unstructured Large Text Corpus using Natural Language Processing and Unsupervised Learning Technique},
journal = {International Journal of Advanced Computer Science and Applications},
doi = {10.14569/IJACSA.2021.0120957},
url = {http://dx.doi.org/10.14569/IJACSA.2021.0120957},
year = {2021},
publisher = {The Science and Information Organization},
volume = {12},
number = {9},
author = {Poornima Kulkarni and Cauvery N K}
}

Copyright Statement: This is an open access article licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, even commercially as long as the original work is properly cited.

Personally Identifiable Information (PII) Detection in the Unstructured Large Text Corpus using Natural Language Processing and Unsupervised Learning Technique

Upcoming Conferences