The Science and Information (SAI) Organization
  • Home
  • About Us
  • Journals
  • Conferences
  • Contact Us

Publication Links

  • IJACSA
  • Author Guidelines
  • Publication Policies
  • Metadata Harvesting (OAI2)
  • Digital Archiving Policy
  • Promote your Publication

IJACSA

  • About the Journal
  • Call for Papers
  • Author Guidelines
  • Fees/ APC
  • Submit your Paper
  • Current Issue
  • Archives
  • Indexing
  • Editors
  • Reviewers
  • Apply as a Reviewer

IJARAI

  • About the Journal
  • Archives
  • Indexing & Archiving

Special Issues

  • Home
  • Archives
  • Proposals
  • Guest Editors

Future of Information and Communication Conference (FICC)

  • Home
  • Call for Papers
  • Submit your Paper/Poster
  • Register
  • Venue
  • Contact

Computing Conference

  • Home
  • Call for Papers
  • Submit your Paper/Poster
  • Register
  • Venue
  • Contact

Intelligent Systems Conference (IntelliSys)

  • Home
  • Call for Papers
  • Submit your Paper/Poster
  • Register
  • Venue
  • Contact

Future Technologies Conference (FTC)

  • Home
  • Call for Papers
  • Submit your Paper/Poster
  • Register
  • Venue
  • Contact
  • Home
  • Call for Papers
  • Guidelines
  • Fees
  • Submit your Paper
  • Current Issue
  • Archives
  • Indexing
  • Editors
  • Reviewers
  • Subscribe

Article Details

Copyright Statement: This is an open access article licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, even commercially as long as the original work is properly cited.

An End-to-End Big Data Deduplication Framework based on Online Continuous Learning

Author 1: Widad Elouataoui
Author 2: Imane El Alaoui
Author 3: Saida El Mendili
Author 4: Youssef Gahi

Download PDF

Digital Object Identifier (DOI) : 10.14569/IJACSA.2022.0130933

Article Published in International Journal of Advanced Computer Science and Applications(IJACSA), Volume 13 Issue 9, 2022.

  • Abstract and Keywords
  • How to Cite this Article
  • {} BibTeX Source

Abstract: While big data benefits are numerous, most of the collected data is of poor quality and, therefore, cannot be effectively used as it is. One of the leading big data quality challenges is data duplication. Indeed, the gathered big data are usually messy and may contain duplicated records. The process of detecting and eliminating duplicated records is known as Deduplication, or Entity Resolution or also Record Linkage. Data deduplication has been widely discussed in the literature, and multiple deduplication approaches were suggested. However, few efforts have been made to address deduplication issues in Big Data Context. Also, the existing big data deduplication approaches are not handling the case of the decreasing performance of the deduplication model during the serving. In addition, most current methods are limited to duplicate detection, which is part of the deduplication process. Therefore, we aim through this paper to propose an End-to-End Big Data Deduplication Framework based on a semi-supervised learning approach that outperforms the existing big data deduplication approaches with an F-score of 98,21%, a Precision of 98,24% and a Recall of 96,48%. Moreover, the suggested framework encompasses all data deduplication phases, including data preprocessing and preparation, automated data labeling, duplicate detection, data cleaning, and an auditing and monitoring phase. This last phase is based on an online continual learning strategy for big data deduplication that allows addressing the decreasing performance of the deduplication model during the serving. The obtained results have shown that the suggested continual learning strategy has increased the model accuracy by 1,16%. Furthermore, we apply the proposed framework to three different datasets and compare its performance against the existing deduplication models. Finally, the results are discussed, conclusions are made, and future work directions are highlighted.

Keywords: Big data deduplication; online continual learning; big data; entity resolution; record linkage; duplicates detection

Widad Elouataoui, Imane El Alaoui, Saida El Mendili and Youssef Gahi, “An End-to-End Big Data Deduplication Framework based on Online Continuous Learning” International Journal of Advanced Computer Science and Applications(IJACSA), 13(9), 2022. http://dx.doi.org/10.14569/IJACSA.2022.0130933

@article{Elouataoui2022,
title = {An End-to-End Big Data Deduplication Framework based on Online Continuous Learning},
journal = {International Journal of Advanced Computer Science and Applications},
doi = {10.14569/IJACSA.2022.0130933},
url = {http://dx.doi.org/10.14569/IJACSA.2022.0130933},
year = {2022},
publisher = {The Science and Information Organization},
volume = {13},
number = {9},
author = {Widad Elouataoui and Imane El Alaoui and Saida El Mendili and Youssef Gahi}
}


IJACSA

Upcoming Conferences

Future of Information and Communication Conference (FICC) 2023

2-3 March 2023

  • Virtual

Computing Conference 2023

22-23 June 2023

  • London, United Kingdom

IntelliSys 2023

7-8 September 2023

  • Amsterdam, The Netherlands

Future Technologies Conference (FTC) 2023

2-3 November 2023

  • San Francisco, United States
The Science and Information (SAI) Organization
BACK TO TOP

Computer Science Journal

  • About the Journal
  • Call for Papers
  • Submit Paper
  • Indexing

Our Conferences

  • Computing Conference
  • Intelligent Systems Conference
  • Future Technologies Conference
  • Communication Conference

Help & Support

  • Contact Us
  • About Us
  • Terms and Conditions
  • Privacy Policy

© The Science and Information (SAI) Organization Limited. Registered in England and Wales. Company Number 8933205. All rights reserved. thesai.org