Deep Learning based Anomaly Detection in Images: Insights, Challenges and Recommendations

Deep learning-based anomaly detection in images has recently been considered a popular research area with numerous applications worldwide. The main aim of anomaly detection (i.e., Outlier detection), is to identify data instances that deviate considerably from the majority of data instances. This paper offers a comprehensive analysis of previous works that have been proposed in the area of anomaly detection in images through deep learning generally and in the medical field specifically. Twenty studies were reviewed, and the literature selection methodology was defined based on four phases: keyword filter, publish filter, year filter, and abstract filter. In this review, we highlight the differences among the studies included by considering the following factors: methodology, dataset, preprocessing, results and limitations. Besides, we illustrate the various challenges and potential future directions relevant to anomaly detection in images. Keywords—Anomaly detection; outlier detection; deep learning


I. INTRODUCTION
Identifying examples that deviate from what is typical or expected is the primary goal of anomaly detection and known as outlier detection [1]. Anomaly detection in images has recently been considered a popular research area with numerous applications in different fields ranging from the video surveillance field to medical fields [2] [3]. Anomalies arise due to various reasons such as data errors or data noises but sometimes indicate a new process that was previously unseen. Thus, anomaly detection is a crucial task, especially in medical image processing.
Many researchers tended to employ deep learning to detect abnormalities in images, due to the proliferation of deep neural networks, with unprecedented results across various applications. It can also deal with complicated features such as regions of interest points by examining every pixel in an image [4] [5].
In fact, deep learning-based anomaly detection have gained prominence and have been applied to various tasks, with the help of the technologies increasingly popular in the medical sector [3] [6][7][8][9]. This is because deep learning overcomes the issue of data being imbalanced, which may result in a bias towards the majority group (i.e., the negative case). Since the medical images for the negative cases are more than the positive ones, we believe that anomaly detection can be considered a better technique to be adopted than the binary classification [9].
There are several papers from different fields in the area of deep learning-based anomaly detection. We believe there is a gap in the literature about having reviews that state the gaps and limitations of the topic of interest of this article. Therefore, we opt to have a review article that collects and comprehensively analyzes recent works on deep learning-based anomaly detection in images. Hence, the community would be able to effortlessly understand the contributions and limitations of each study and to overcome these limitations in their future work.
This study aims to illustrate the state-of-the-art techniques for anomaly detection in images by reviewing recent studies that leverage deep learning techniques for anomaly detection. In our survey, we classify anomaly detection into two categories: general and medical fields in the context of medical anomalies. This study also discusses several factors that make the anomaly detection approach challenging. Such factors include the availability of labeled data, how to deal with noise that tends to be similar to the actual anomalies, and therefore, difficult to distinguish.
The significant contributions of this paper are as follows: (a) A comprehensive analysis of previous works that have been proposed in the area of anomaly detection in images through deep learning generally and in the medical field specifically by considering methodology, dataset, pre-processing, findings and limitations, outlining the difference between these studies. (b) Illustrate the various challenges and potential future directions relevant to anomaly detection in images.
The remainder of this article is organized as follows: the background of this study is given in Section II. In Section III we provide the necessary information for the reader to understand the rest of the article. Section IV discusses the literature selection methodology. Recent works of deep learning-based anomaly detection are reviewed in Section V. Observations and challenges are discussed in Section VI, while we conclude and provide the future work in Section VII and VIII.

II. BACKGROUND
This section explains the necessary background to understand the various elements of this article. We briefly explain the elements of the context of this review (i.e., anomaly detection, deep learning, and automated medical image diagnosis).

A. Anomaly Detection
Anomaly detection, known as outlier detection, is defined as the process of identifying data instances that deviate tremendously from other data instances [4]. As shown in Fig.  1, "N1" and "N2" are regions containing the majority of observations and are therefore considered to be normal data instance regions, while the "O3" area and the "O1" and "O2" data points are the few data points located far from the bulk of the data points. Given that "O3", "O1", and "O2" are therefore considered to be anomalies. They occur due to data errors but sometimes indicate a new basic process that was not previously known [5]. Anomaly detection plays an increasingly important role and is highlighted in different communities, including machine learning, computer vision, and data mining [4].

B. Deep Learning
In recent years there has been exponential development of deep learning and has been shown through several various application areas. Deep learning is considered a sub-domain of the machine learning field that aims to achieve good performance and flexibility [4]. As R. Chalapathy et al. stated in [5], deep learning achieves outstanding performance and flexibility than machine learning through learning to represent data as a nested hierarchy of concepts within the layers of a neural network. As Fig. 2 shows, deep learning outperforms the conventional approaches of machine learning considering the increased data scale [10].

C. Automated Medical Image Diagnosis
In the field of medical image processing, automated diagnosis is the primary and most important task. Automated diagnosis is based on the detection of abnormal behavior in the images [11]. Still detect abnormalities such as malignant tumors from medical images, including mammograms or CT scan, are ongoing research problems that attract a lot of attention with applications in medical diagnosis [9].

III. TERMINOLOGY
There are basic terminologies in the anomaly detection field, and they are as follows.

A. Deep Learning
Deep learning is "learning feature hierarchies with features from higher levels of the hierarchy formed by the composition of lower-level features" [12]. Means deep learning learns layers of features.

B. Anomaly Detection
Anomaly detection is the process of identifying data instances that deviate from what is normal or expected data [1].

C. Semi-Supervised or (one-class classification) Deep Anomaly Detection
Defined as "a technique assumes that all training instances have only one class label" [5].

D. Unsupervised Deep Anomaly Detection
Unsupervised is "a technique that used automatic labeling of unlabeled data samples" [5].

E. Normal Data
Normal data are the majority of data instances (usually be the negative data in the medical field) [5] [9].

F. Anomalous/Abnormal data
Abnormal data are the deviants in data instances (usually be the positive/diseases data in the medical field)[5] [9].

G. Anomaly Score
is "describes the level of outlierness for each data point" [5].

IV. LITERATURE SELECTION METHODOLOGY
In order to review the most important anomaly detection literature for this review, an existing selection methodology was having been adapted from [13]. This section provides a description of the process for selecting literature (see Fig. 3).

A. Keywords Filtering Stage
We started by selecting the related articles from the Google Scholar search engine, arXiv and bioRxiv using at least one of the following keywords in the title of the article: (1) anomaly detection, (2) anomaly detection in images, (3) anomaly detection in medical images, or (4) deep learning-based anomaly detection. Results from this stage 55 articles.

C. Year Filtering Stage
The methodology of literature selection also focused on recent research articles in recent years by considering the following years only: (1) 2020, (2) 2019, and (3)

D. Abstract Filtering Stage
An abstract reading was carried out in view of the 28 articles from the previous stage in order to identify only the most important articles that specifically study the deep learning-based anomaly detection in images and focus in particular on the medical field. Therefore, from the anomaly detection literature, 20 articles were chosen.

V. RECENT WORKS OF DEEP LEARNING-BASED ANOMALY DETECTION
In this paper, twenty papers on detecting anomalies in images through deep learning generally and in the medical field specifically were reviewed. Fig. 6 presents the percentage of articles for each field.

A. General Field
This section will present some previous works of anomaly detection in terms of the general field.
The authors of this research [14], proposed Deep Semi-Supervised Anomaly Detection (Deep SAD). Furthermore, they presented an information-theoretic framework for deep anomaly detection, which as minimizing the entropy of the   latent distribution for normal data and maximizing the entropy of the latent distribution for anomalous data. The experiments were on several different public datasets and comparing their method with other previous methods. The results show that the method of this paper was on par or outperform other methods that compared it. The authors did not consider the problem of the difficulty availability of label anomalies.
This study [15] presented Iterative Training Set Refinement (ITSR), which is a novel method. An adversarial autoencoder architecture is geared to overcome the shortcomings of conventional autoencoders in the existence of anomalies in the training set. They used two public datasets, MNIST, and Fashion-MNIST datasets. The results show that their method has better accuracy than traditional autoencoders and adversarial autoencoders. However, they did not experiment with their method when there are noises in images which means do not consider preprocessing data. Also, they did not compare their result with other works or state-of-the-art methods.
This research [16] proposed a new framework and its instantiation Deviation Networks (DevNet) to take advantage of a few labeled anomalies with a prior probability to fulfill end-to-end differentiable learning of anomaly scores. Nine publicly available real data sets were used, and are from various critical fields, for example, fraud detection, disease detection, malicious URL detection, and intrusion detection. The experimental findings indicate that their current approach was more effective score than state-of-the-art competing methods. But the authors did not examine the lack of label anomalies data in the real world, particularly in medicine field.
On the contrary, using an unsupervised model is the proposed method of this paper [17], where the authors present a Deep Autoencoding Gaussian Mixture Model (DAGMM) for unsupervised anomaly detection. The experiment was applied to four public benchmark datasets and compared the results with state-of-the-art anomaly detection techniques. The results indicate that DAGMM exceeds state-of-the-art anomaly detection methods with a 14% improvement based on the standard F1 score. However, they did not test their method on images with noises to show the extent of its impact on the results.

B. Medical Field
This section will present some previous works of anomaly detection in terms of the medical field by considering the application area.
1) Breast: According to new research by [18], the authors introduced a new method that is a new measure for determining the effect of a particular sample on a task, allowing to detect samples outside of distribution. Their method integrated into a simple autoencoder CAE model for the abnormality recognition task. Examination of their method on Breast Magnetic Resonance Imaging (MRI) and Breast Full-Field Digital Mammography (FFDM) datasets. Experimental results demonstrate that the new method exhibits remarkable performance and outperforms the compared methods with accuracy 90.1% and 95.6% in MRI and FFDM datasets respectively. The experiments of the method are done on small datasets relatively.
The authors of this research [19] an architecture with two deep convolutional networks (R and M) proposed for irregular tissues in mammography images. They used three public datasets, the Mammographic Image Analysis Society (MIAS) and INbreast dataset for training their method. Curated Breast Imaging Subset of Digital Database for Screening Mammography (CBIS-DDSM) dataset to test their method. The accuracy they achieve is 76% and 86% in MIAS and INbreast datasets respectively. However, the datasets used are of small size. Moreover, they did not consider processing the whole image in one step in the model. This study [9] designed an autoencoder based on a deep neural network to detect an anomaly in medical images based on one-class classification. The INbreast dataset is used, and the performance was 84%. Also, this paper used a small dataset. Furthermore, they did not compare their result with other works or state-of-the-art methods.
2) Chest: In terms of the chest area, the confidenceaware anomaly detection (CAAD) model for viral pneumonia screening from non-viral pneumonia and healthy controls have been implemented in recent research [20] into a oneclass classification-based anomaly detection challenge. Their model consists of a function extractor, a module for detecting anomalies, and a module for predicting confidence. Four datasets were used, which are X-VIRAL, X-COVID, public COVID-19, and lastly combine the X-COVID and Open-COVID datasets. The results show the accuracy is 87.57%, 83.61%, 94.93%, and 84.43% for datasets respectively. The only limitation of this research is it did not try to consider comparing without data preprocessing to see if there is much difference in results or not.
This study [21] presented an abnormality detection method based on an autoencoder with uncertainty prediction. This method is able to reconstruct the image with pixel-wise uncertainty prediction. Two public chest X-ray datasets were used: RSNA Pneumonia Detection Challenge dataset and pediatric chest X-ray dataset. The area under the curve (AUC) was 89% and 78% for datasets respectively. There is no preprocessing data step.
In [22] an end-to-end architecture to determine a chest Xray abnormal using generative adversarial one-class learning was proposed. It is similar to generative adversarial networks (GANs). Their architecture consists of a U-Net autoencoder, a CNN discriminator, and an encoder. The experiments were done on the NIH Clinical Center Chest X-ray dataset, and they achieve 80% accuracy to detect lung opacities. But their architecture results did not compare with other algorithms.
3) Brain: Since recently, researchers have shown an increased interest in Generative Adversarial Network (GAN) on deep learning. Accordingly, this paper [23] introduced unsupervised anomaly detection Generative Adversarial Network (MADGAN) method using multiple adjacent brain MRI slice reconstruction. This approach is capable of detecting various diseases at different stages on multi-sequence structural MRI. Two different datasets were used. The MRI dataset was extracted from the Open Access Series of Imaging Studies-3 (OASIS-3) and the second dataset was collected by the authors (National Center for Global Health and Medicine, Tokyo, Japan) which is brain metastasis and various disease MRI dataset. The results demonstrate that this method can detect anomaly detection at a very early stage with 72.7% and at a late stage with 89.4% in terms of area under the curve (AUC). But their method results did not compare with other algorithms.
A method of using GANs trained from multi-modal magnetic resonance images (MRI) as a 3-channel input is defined and demonstrated by the authors in [24]. Their model was used to detect tumour as an anomaly. The dataset was from The Cancer Imaging Archive. The resulting accuracies that differ substantially in the size of the anomaly have been observed. The area under the receiver operator characteristic curve (AUROC) was observed to be greater than 75% for anomaly sizes greater than 4 cm 2 . The dataset consists of 20 patients, which is very small.
In [25] proposed a semi-supervised anomaly detection model to detect brain tumor abnormalities. The model consisted of four components which are the encoder-decoder part, the discriminator, latent regularizer, and auxiliary encoder. The model first has been tested on two benchmark datasets which are MNIST and CIFAR-10 for comparison with state-of-theart methods. Then applied the model on the HCP database and BraTS dataset. Where using normal images from the HCP database as training data and the whole BraTS 2019 dataset as the test data. The results were 93%, 79.7% for MNIST and CIFAR-10 respectively. 99.4% for the BraTS dataset. There is no preprocessing data step.

4) Eye:
In research [26], it proposed a novel P-Net for retina image anomaly detection. Their network architecture consisted of three modules which are structure extraction from the original image module, image reconstruction module, and structure extraction from the reconstructed image module. Two datasets have been used, which are Retinal Edema Segmentation Challenge Dataset (RESC) and Fundus Multi-disease Diagnosis Dataset (iSee). The result was 92.88% and 72.45% for both datasets, respectively. There is no preprocessing data step.
This study [27] proposed a transfer-learning-based approach for unsupervised anomaly detection. The methodology used a convolutional neural network as a feature extractor and Isolation Forest anomaly detection method as a classification. Two benchmark datasets (CIFAR-10 and SVHN) were used, and two retinal fundus image datasets, which are Retinopathy of Prematurity (ROP) and Diabetic Retinopathy (DR) were used. The results were 88.2%, 55.4% for CIFAR-10 and SVHN respectively. 77% and 74.5% for the ROP and DR respectively. The authors did not try to consider comparing without data preprocessing to see if there is much difference in results or not. Furthermore, the medical imaging performance results need improvement.

5) Abdomen:
In another research that used an unsupervised model, the authors in [28] have considered the problem of other organs than the stomach in a gastric X-ray examination, which can be noisy and cause decadence of classification performance. Therefore, they proposed a deep learning-based anomaly detection model inspired by DAGMM as an organ classification task. The experiment was on one dataset, which is gastric X-ray images, and comparing with other approaches. The results show that their model outperforms the comparison models with 95.6% in terms of sensitivity. The limitation of this paper was having a small number of stomach images with barium leaks in the gastric X-ray examinations, which are not useful in gastritis detection.

6) Cardiac:
Another application area of the medical field in [29] where the authors proposed the decision boundarybased anomaly detection model using improved AnoGan from ECG data. The proposed model achieves 94.75% in MIT-BIH Arrhythmia ECG dataset, which is the best performance compared with many different models. The authors did not consider testing the model without their data preprocessing to illustrate the difference ratio.

7)
Musculoskeletal: This study [30] presented a preprocessing pipeline and survey unsupervised deep learning methods for an anomaly detection task. They were comparing these methods with each other with and without their preprocessing pipeline to demonstrate which algorithm is better for this task and also to show the effect of the presence of preprocessing pipeline on the performance. They work on a subset of the MURA dataset, which is X-Ray images of hands. The results illustrated that the best model is α-GAN based (GANs) approach with 60.7%, and the best model-based autoencoder is convolutional auto-encoder (CAE) with 57%. However, the experiments were on a small dataset because they did not use a full MURA dataset.

A. Overview
Many studies have worked on anomaly detection algorithms. Summary of the related studies on deep learning-based anomaly detection in images is presented in Table I and II for the general and medical fields respectively. After reviewing the studies, the following was observed. First, most researchers use deep learning other than machine learning. Because deep learning has better performance and can handle the complexity of images and large datasets efficiently. Second, most researchers either in general or medical fields have used unsupervised [9] [ [17][18][19][20] [23,24] [26][27][28][29][30], or semi-supervised [14][15][16] [21,22] [25] learning methods in an anomaly detection task. Third, most of the researches does not leverage a limited number of labeled anomalies as prior knowledge. Therefore, using this technique in future work is a good idea to avoid identifying anomalies as data noises or uninteresting data due to the lack of prior knowledge of the anomalies of interest and to increase the model's performance, as shown in [16]. Fourth, some studies used a small dataset [9] [18,19] [24] [30]. So, there is a lack of used large datasets in an anomaly detection task. Fifth, data preprocessing is an essential technique to obtain good performance, as shown in [30]. Some researchers considered it [18] [27][28][29][30], and others are not. Sixth and finally, most studies consider the comparison with many different algorithms to illustrate the evaluation metrics of each of them, and that is an important aspect of evaluating the effectiveness of the model.

B. Challenges
There are numerous factors that make anomaly detection very challenging. First, handling the class imbalance of normal and abnormal data. Second, availability of labeled data. Third, there is often noise in the data that appears to be close to the actual anomalies and thus difficult to differentiate them [33]. Fourth, the exact concept of the anomaly varies with different areas of application. For example, fluctuations in body temperature are a small deviation from normal and might be an anomaly in the medical field. On the other hand, fluctuations in the value of a stock with a similar deviation might be normal in the stock market domain [33]. So, it is not straightforward to adapt a method developed in one field to another.