Deep Learning based Neck Models for Object Detection: A Review and a Benchmarking Study

Artificial intelligence is the science of enabling computers to act without being further programmed. Particularly, computer vision is one of its innovative fields that manages how computers acquire comprehension from videos and images. In the previous decades, computer vision has been involved in many fields such as self-driving cars, efficient information retrieval, effective surveillance, and a better understanding of human behaviour. Based on deep neural networks, object detection is actively growing for pushing the limits of detection accuracy and speed. Object Detection aims to locate each object instance and assign a class to it in an image or a video sequence. Object detectors are usually provided with a backbone network designed for feature extractors, a neck model for feature aggregation, and finally a head for prediction. Neck models, which are the purpose of study in this paper, are neural networks used to make a fusion between high-level features and low-level features and are known by their efficiency in object detection. The aim of this study to present a review of neck models together before making a benchmarking that would help researchers and scientists use it as a guideline for their works. Keywords—Object detection; deep learning; computer vision; neck models; feature aggregation; feature fusion


I. INTRODUCTION
Object detection is often called image detection, object identification, and object recognition; and all these concepts are synonymous. It is a computer vision method for locating instances of objects in an image or video sequence. Object detection algorithms, therefore, typically benefit from machine learning techniques or deep learning techniques to gain meaningful results. When humans look at images or videos, they could locate and recognize objects of interest easily. The goal of object detection is to mimic this intelligence using a computer. With recent advancements in Deep Learning-based computer vision models, Object Detection use cases are spreading more than ever before. A wide range of applications is implemented, for instance, self-driving cars, object tracking, anomaly detection, and video surveillance.
Object Detection could be divided into two main categories Deep Learning-based techniques and Machine Learning based techniques. Deep Learning based techniques could be separated into two approaches one stage detectors and two-stage detectors. Object Detection based Deep Learning approaches are a set of models of Deep Learning, starting from input, then a backbone for feature extraction model, then neck model for feature fusion, and finally a head model class/box network.
The neck of the object detector refers to the additional layers existing between the backbone [1] and the head. Their role is to collect feature maps from different stages. The neck models are composed of several top-down paths and several bottom-up paths. The idea behind this feature aggregation existing in this model is to allow low-level features to interact more directly with high-level features, by mixing information from this high-level feature with the low-level feature. They reach aggregation and feature interaction across many layers, since the distance between the two feature maps is large. Several methods can reach be implemented in this part, for example, PAN [2] or FPN [3] (see Fig. 1).
Head is the last model of object detection, predicts bounding boxes and classes of objects and could be a sparse prediction that belongs to One-stage detectors such as YOLO [4] , SDD [5], CenterNet [6], or a Dense prediction that belongs to Two-stage detectors, such as Fast R-CNN [7], Faster R-CNN [8], Mask R-CNN [9] (see Fig. 1). On the one hand, One Stage detectors have high inference speeds, these models predict bounding boxes in a one or single step without using region proposals. On the other hand, two stage detectors have high localization and recognition accuracy. Firstly, they use a Region Proposal Network to generate regions of interests; secondly, they send the region proposals for object classification and bounding-box regression.
We aim that our benchmarking study can provide a timely comparison of neck models of object detection for practitioners and researchers to further master research on object detection models. The rest of our study is organized as follows: In Section 2, we are going to discuss the different existing related works about feature aggregation. In Section 3, we list the neck neural networks about object detection used for feature fusion, their architecture is discussed also in their categories. In Section 4, our comparative study is presented. In Section 5, we highlight the different recognizable results and Section 6 covers the discussion. Finally, in Section 7, we conclude and discuss future directions. www.ijacsa.thesai.org

II. RELATED WORK
Several scientific works and researches have been implemented to develop and evolve Object Detection applications and systems and depend on enormous methodologies of the deep learning era, machine learning era and other eras. Several researchers and scientists are expanding their implementation and research to develop and apply enormous methodologies. Such us the case of feature aggregation methods that are used to make a connection between low and high feature for better object recognition in video sequence and images. Feature aggregation is used widely in action recognition [10], [11], [12], [13], [14] and video description [15], [16]. Most of these methods use recurrent neural network (RNNs) in order to aggregate features from consecutive frames on the one hand. Exhaustive temporal-spatial convolution is used to extract temporalspatial features, on the other hand. U-Net [17] was proposes to concatenate features from low level to high-level for medical image segmentation, and it achieved great success in that field. In order to gain an outstanding feature for object detection, the FPN stands for Feature Pyramid Networks aggregated both the transformed feature from the bottom-up weighted pyramid and the top-down lateral convolutions through a simple sum operation. Relied on Feature Pyramid Networks, several extensive works [18], [19], [20], [2] define new option on connectivity between scales. Attention based models also prove their efficiency in several applications of deep learning era [21], [22], [23], [24], [25], [26]. Selfattention models by measuring and applying a context relied encoding summarized from a dimension of feature. All these works cited propose to aggregate and fuse features via element-wise concatenation or summation.

III. BACKGROUND
Since Feature Pyramid Networks appearance, the focus of this work is the object detector neck, the existing part between the backbone and the head. These techniques are useful for many reasons.

1) Aggregation network models (FPN): FPN [3] is a top-
down architecture with lateral connections, it is implemented in building high-level semantic feature maps at all scales (see Fig. 2).

4) Bi-directional feature pyramid network (BiFPN):
BiFPN [27] is a type of feature pyramid network that allows fast and easy multi-scale feature fusion. BiFPN incorporates the other feature fusion models. It enables information to flow in the top-down and bottom-up directions, while using efficient and regular connections. This network improves the connections by removing some nodes and treats each bidirectional path as a feature network layer (Fig. 5). Based on the architecture above PANET is more performant then FPN et NAS-FPN, but the computation cost is higher.

5) Fully-connected FPN:
Fully-connected, the calculation is the most complex all scales use the most complete connection (see Fig. 6).   This table illustrates the deep learning models used for the object detection task of the COCO dataset. It defines the used models for the prediction for classification and bounding boxes. The Backbone determines the backbone used for feature extraction the number associated refers to the number of layers, and finally, the neck illustrates the feature aggregation network used.  Table I). In this part, we are going to discuss the performance of different methods cited in Table I

1) Libra R-CNN:
We have compared Libra R-CNN [18] with different backbones. This comparison reveals that the act of changing backbones with a solid feature aggregation model changes the performance. Regarding, Libra R-CNN with ResNeXt-101 as a backbone on top of the quality range. The two last models based on ResNet-50 and ResNet-101 as backbones, Libra R-CNN based ResNet-101 gain the highest performance (see Fig. 8). 2) Faster R-CNN: Faster R-CNN [8] relying on ResNext-101-64x4d as a backbone and AugFPN as a feature aggregation model are leading the performance in this category. By fixing ResNet-50 as a backbone with changing different feature aggregation, the model based on AdaFPN gains the highest performance. Moreover, by fixing AugFPN and changing ResNext-101 the best performance was gained by ResNext-101-64x4d (see Fig. 9).

3) FCOS:
The highest performance was obtained by FCOS [28] on the head, ResNext-101 as a backbone, and FPN as a feature aggregator model. By changing feature aggregation models FPN, AdaFPN, and AugFPN, moreover fixing ResNet-50 the AdaFPN gains the best performance in this category, after that FPN and finally AugFPN (see Fig. 10).

4) Mask R-CNN:
Regarding Mask R-CNN [9] models based on a diversity of backbones and necks relied on our category, ResNet-101 and FPN combination leads the performance then, ResNeXt-101 and FPN. By fixing ResNet-101, mutating feature aggregation models the highest performance was gained by AugFPN, then FPN, and finally A2FPN. Concerning ResNet-50 as a backbone and A2 FPN or AugFPN as feature aggregation models, AugFPN attain the greatest performance (see Fig. 11).

6) Cascade R-CNN:
Cascade R-CNN [29] performance was led by merging ResNet-101 and AC-FPN. The combination of ResNet-101 as a backbone and FPN neck has gained less performance (see Fig. 13).

8) Six Top average precision:
On the one hand, after extracting the 6 best models in terms of average precision, we have preferred to compare the methods that gain the top average precision. On the other hand, in terms of performance and based on our spider, centerNet2 achieves the best performance. The best method is based on Res2Net101-DCN as a backbone and BiFPN as a feature aggregation model. The second rank is for DetectRs based on ResNeXt-101-DCN as a backbone and RFP as feature extraction (see Fig. 15).

VI. DISCUSSION
In this paper, we have systematically depicted the importance of object detection components, covering the deep learning methodologies used in object detection, including, Two Stage detectors and one stage detectors.
Firstly, we have started by presenting object detection methodologies that have been categorized on traditional methods and based deep learning methodologies. Secondly, we have talked about the main arrangement of object detection based on deep learning that includes a backbone usually pretrained used to extract feature then feature aggregation model for merging high and low features called neck and finally, the head used for prediction.
Relied on our comparative study, we notice that the CenterNet2 with Res2Net-101-DCN as a backbone and BiFPN as a feature fusion model leads the performance and gains widespread dominance because of its supremacy regarding all criteria.
DetectRS with ResNeXt-101-DCNas a backbone and RFP as a feature fusion model is reaching the second score. HTC is gaining the third position with its high performance based on ResNeXt-101 as a backbone and FPN. We notice also that there is no intersection between all the compared algorithms, each algorithm gains its performance regarding all criteria that the underlying algorithm. This comparison has also been made based on a set of criteria. The scores for each method evaluated were calculated using the Weight Score Model. Various scores or results have not only helped us determine an overall ranking, but they have also shown their internal strengths and weaknesses concerning each criterion. This comparison has also revealed the importance of making a benchmark in order to have a global straightforward view of building efficient models with high performance. www.ijacsa.thesai.org One the one hand, we hold in mind that from this review and comparison study that object detection based deep learning models, backbone, neck and head, impacting highly the performance. On the other hand, generally, more used layers give high performance.

VII. CONCLUSION
From the study handed, it has been noticed that several scientists and researchers from a diversity of ethnicities are working day after day on the object detection field, due to its utmost importance. Several models are appearing every month with the growth of deep learning.
This comparison could be used as a support, by handing researchers a scientific comparison of different object detection methodologies and their main models, in order to build performant models.
A comparison of neck used for feature aggregation between high and low features has been presented. We have been interested in giving you different necks and analyse the performance of their global models.
Future work will be focusing on the implementation of some of the different models of object detection-based deep learning. We aim to implement, test, and analyze the results.