D2-Net: Dilated Contextual Transformer and Depth-wise Separable Deconvolution for Remote Sensing Imagery Detection

—Remote sensing-based object detection faces challenges in arbitrary orientations, complex backgrounds, dense distributions, and large aspect ratios. Considering these issues, this paper introduces a novel method called D2-Net, which incorporates a transformer structure into a convolutional neural network. First, a new feature extraction module called dilated contextual transformer block is designed to minimize the loss of object information due to complex backgrounds and dense targets. In addition, an efficient approach using depth-wise separable deconvolution as an up-sampling method is developed to recover lost feature information effectively. Finally, the circular smooth label is incorporated to compute the angular loss to complete the rotated detection of remote sensing images. Experimental evaluations are conducted on the DOTA and HRSC2016 datasets. On the DOTA dataset, the proposed method achieves 79.2% and 78.00% accuracy in horizontal and rotated object detection, respectively; it achieves 94.00% accuracy in the rotated detection of the HRSC2016 dataset. The proposed model shows a significant performance improvement over other comparative models on the dataset, which verifies the effectiveness of our proposed approach.


I. INTRODUCTION
Due to advances in computer processing power, object detection has developed rapidly over the past decade.This task typically accomplishes by utilizing single-stage detectors, typified by the YOLOs models [1], [2], [3], [4], [5], and dualstage models exemplified by the RCNN series [6], [7], [8], [9].Despite significant advances in generic target detection, the mission in remote sensing images (RSIs) faces numerous challenges due to characteristics such as substantial variations in scale, crowded and small targets, arbitrary orientations, and large aspect ratios [10].Therefore, detection using oriented bounding boxes (OBBs), which can handle object rotation, has become critical in remote sensing applications.Existing rotated object detection models are often constructed with pure convolutional neural networks (CNNs) or CNN-transformer hybrid structures.And the former has a lot of representative work.Pixels-IoU Loss improves performance for complex backgrounds and large aspect ratios [11] but increases training time.Rotational region convolutional neural network (R2CNN) introduces joint prediction of axis-aligned bounding boxes and inclined minimum area boxes to complete text recognition in any direction [12].A joint image cascade (ICN) and feature pyramid network (FPN) can capture semantic features at multiple scales [13].Adaptive period embedding (APE) proposed by Zhu et al. represented oriented targets in a novel way and length-independent IoU (LIIoU) suitable for long targets [14].Kim B et al. developed TricubeNet, which locates oriented targets according to visual cues such as heat maps rather than oriented box offset regression [15].
Although CNNs have achieved impressive performance, they are limited by the difficulty of obtaining long-range dependencies, resulting in deficient performance in remote sensing detection.In contrast, the unique structure of the transformer allows it to compensate well for the shortcomings of CNN.Many hybrid CNN-transformer networks have achieved satisfactory results [16], [17], [18], [19], [20].The RoI transformer technique utilizes spatial transformations on Regions of Interest (RoIs) and learns the spatial transformation parameters by using OBB annotations as supervision.This approach results in fewer mismatches during detection [16].To address the boundary loss and spatial receptive field issues in RSIs, Dai et al. developed a rotating object detection transformer-based model (RODFormer) [17].Another improved detector, CLT-Det, leverages correlation learning and a transformer to tackle the problem of large-scale variation and dense targets [18].TransConvNet uses a self-attention block and CNN to aggregate broad and specific details, offsetting the CNN's lack of rotational invariance [19].Li et al. propose an adaptive points learning method that effectively obtains geometric information for instances of arbitrary orientations [20].
The above information suggests that incorporating a transformer module into CNN can help overcome the model's difficulty in global feature modeling.And recent researches show that simple hybrid networks can acquire the same effect as many excellent complex models [21].Therefore, this paper presents the dilated contextual transformer block (DCoT) combined with efficient layer aggregation networks (ELAN) in YOLOv7 [5] to improve the model's feature extraction capability.DCoT extraction provides more feature information with a larger receptive field, allowing shallow location information to combine effectively with deep semantic information, which improves detection ability in complex backgrounds and dense objects.Second, a depth-wise separable deconvolution (DS-DeConv) module is proposed to enable the model to generate more diverse feature information during upsampling, thereby improving its ability to detect small and dense objects.Finally, the Circular Smooth Label (CSL) [22] is integrated into the baseline YOLOv7 [5] to complete the rotation detection process without being affected by boundary discontinuities.Extensive experiments were conducted on DOTA v1.0 [10] and HRSC2016 [23] datasets to validate the efficacy of the proposed method.The experimental results demonstrate that the proposed model enhances the detection capacity of RSIs.Moreover, it achieves real-time detection with a slight reduction in the number of parameters, striking a balance between accuracy and speed.
The main contributions of this paper can be summarized as follows.First, we use DCoT to improve the model's ability to obtain contextual information, which enhances the model's ability to detect complex backgrounds and dense targets in RSIs.Second, we use DS-DeConv for upsampling, which effectively preserves detailed feature information, enhancing the model's detecting ability of small objects.Finally, CSL is integrated into YOLOv7 to complete the rotated detection of multi-directional objects in RSIs.The proposed model outperforms other comparative models in detection.
The upcoming sections are structured as follows.Section 2 the related work, including pure CNN and CNN-transformer hybrid detection models.Section 3 provides a detailed description of the proposed methods integrated into the D2-Net.Section 4 is the experimental details and analyses of the experimental results.Finally, the conclusion is presented in Section 5.

A. Pure CNN Detection Models
Depending on whether models generate region proposals, detection models consist of two types: single-stage detection methods and two-stage detection algorithms.
The single-stage detection methods directly predict object class and location without region proposal, resulting in faster inference and lower computational complexity than two-stage models.Redmon J et al. proposed the first generation of YOLO [1], which starts with real-time object detection time.This model views target recognition as a regression task and detects the presence of an object by determining whether the object's center point falls within a particular grid cell, which is obtained by dividing the image into multiple grid cells.Inevitably, it cannot solve problems of dense, small, and large aspect ratio targets and other issues that inspire other researchers to make further progress.To improve the accuracy of small target detection, SSD [24] feeds multiple features extracted from different layers of the feature extraction model to the object prediction module.It also simplifies the training process for targets with different shapes by assigning different scales and aspect ratios to the prior bounding boxes associated with each grid cell.The method used convolutional layers instead of fully connected layers and produced the same results as contemporaneous two-stage detection models.More recently, an enhanced SSD [25] introduces interactive multiscale attention to acquiring more effective feature representation capability.Retinanet [26] incorporates focal loss and effectively addresses the class imbalance problem, resulting in high speed and accuracy performance.Two-stage detectors also gained significant attention due to their remarkable accuracy and robustness.RCNN [6] treats the detection task as a classification problem.In the first stage, it extracts region proposals from each image, then predicts targets' categories after computing features in CNN.FPN [27] regards layers with consistent feature map sizes as a stage and achieves the top-down integration of multi-scale feature maps through successive stages.It distributes features based on object scale, merging deep-level semantic information with shallow-level fine-grained information to perform more accurately.Mask R-CNN [9] innovates RoI alignment to mitigate date missed owing to feature quantization during the RoI pooling process.

B. Transformer Detection Models
Since transformers were introduced to computer vision, many distinctive models emerged.Vision transformers divide the image into multiple patches, provide them with positional embedding, and then feed the feature information into the head for detection.[28] This allows the model to be independent of image size.DINO improves DETR-like models in terms of performance and efficiency by using a comparative denoising training method, a hybrid query selection method for anchor initialization, and a look-forward double scheme for box prediction.[29] Biformer proposes a novel dynamic sparse attention via bi-level routing for more flexible computational allocation and content awareness, enabling dynamic queryaware sparsity.[30]

C. CNN-Transformer Hybrid Detection Models
The emergence of the Transformer structure compensates for the shortcomings of the pure CNN structure in obtaining long-range dependencies and contextual information, leading to numerous Transformer-related models.However, the pure transformer models have high memory consumption and complexity.So more models fuse the transformer module with CNN by insertion or replacement to achieve a balance.RoI transformer [16] conducts spatial transformations on RoIs, learning transformation parameters supervised by OBB annotations, which solves dense RSI targets and RoI-target mismatches.RODFormer [17] addresses boundary loss and spatial receptive field lack in RSI detection via a structured transformer model.CLT-Det [18] presents a correlation learning detector for solving the problem of large-scale variation and dense targets.TransConvNet [19] merges a self-attention block and CNN, aggregating the detailed and specific information to compensate for the CNN's deficiency in rotational invariance.Li et al. proposed a robust adaptive points learning methodology to extract the geometric information of instances of arbitrary orientations [20].
To summarize, the combination of transformer and CNN can effectively overcome the limitation of CNN structures in capturing features at varying scales and improve the accuracy and robustness of object detection.

III. METHODS
In the following parts of this section, we begin with a short introduction to the overall architecture of the proposed D2-Net, taking YOLOv7 [5] as the baseline model.Next, we present a detailed description of the DCoT block and the depth-wise separable deconvolution.Finally, we briefly discuss the CSL [22], which is integrated into our model to accomplish the task of rotation detection.Fig. 1 and Fig. 2 show the overall structure and detailed block consistency, respectively.

A. The D2-Net Structure
As depicted in Fig. 1, the backbone network extracts feature maps c i , which are then sent to the neck network, where i = 3, 4, 5 represents the level of features, and C i has a stride of 2 i and is 1/2 i pixel density of the input image size W ×H. The neck network consists of two modules.The initial component is the FPN [25] architecture, which propagates semantic features from higher to lower resolutions.The second module utilizes the PAFPN [31] module.To compensate for the loss of fine-grained information caused by resolution reduction, an ascending feature merging is employed to transfer location details to feature maps at deeper layers.Furthermore, depthwise separable deconvolution makes the most suitable upsample method by itself, and the improved ELAN module is adopted to improve the reception capability of contextual information of the network.Different scales feature maps containing detailed semantic and rich localization information are output to the RepConv block.Finally, the head network with CSL predicts object categories and position information regarding the angular problem as classification.
In our method, after being processed by the improved neck network, the output feature representations with various resolutions achieve a balance between semantic information in deep and shallow spatial details, leading to improve detection performance.

B. Contextual Transformer Block with Dilated Convolution
Drawing inspiration from the self-attention mechanism in Transformer models, numerous scholars have investigated the effectiveness of hybrid networks mixed by CNNs and transformers in computer vision task scenarios [16], [17], [18], [19].And as existing researches prove, through the simple fusion of CNNs and transformers, object detection models pay more attention to more useful features so that the performances of those models are improved.Therefore, the hybrid network, including the transformer module, has a good prospect in the RSIs detection task.
The traditional self-attention modules utilize input feature information obtained from various spatial positions to process input data.Nevertheless, these modules acquire knowledge of all possible query-key connections by training on individual query-key pairs.This process occurs independently, without considering the contextual information between their interactions.The CoT [32] architecture can integrate abundant contextual information and Contribute significantly to the visual representation of 2D images.Nevertheless, the standard convolution operation will lose much localization information in feature processing.Therefore, we replace it with dilated convolution to form DCoT, which effectively makes the network increase the receptive field while obtaining more information.Then we displace the last three CBS modules of ELAN with DCoT to form ELAN-D (see Fig. 4), which reduces the calculation amount and FLOPs.By combining the strengths of the Transformer and CNN, the DCoT module can capture both global and detailed local information from input features.This approach improves the network model's ability to represent input information features, leveraging the advantages of each component.It showed the architecture of the DCoT block in Fig. 3.For input feature X, it is processed through three pathways, namely Q(queries), K(keys), and V (values), to generate more feature information.The keys undergo dilated convolution to capture local information and increase the receptive field.Then, K is concatenated with X to supplement local information and passed through a CBR module and a standard convolution to generate Q.Finally, Q is multiplied with V and fused with K to obtain the final output Y .The Q, K, and V can be written as: where X is the input feature, W □ are different convolutional blocks.

C. Depth-wise Separable Deconvolution for Up-sampling
During object detection with deep learning, the resolution of the feature map tends to decrease as the network deepens, leading to a loss of information.Thus, up-sampling is essential for an algorithm.In the YOLO algorithms, nearest neighbor interpolation is employed for up-sampling.However, focusing solely on the nearest pixels has also resulted in image quality and details loss, especially for tiny targets.Deconvolution is also a commonly used up-sampling method.Compared with neighbor interpolation, it performs better than in preserving feature information.However, it produces more parameters as well.Deep separable convolution [33] disassembles traditional convolution into depth convolution and point convolution, which can make the model more efficient and parameter reduction.
In this paper, we propose the DS-DeConv block for upsampling.With this method, more diverse pixel values can be produced when recovering the feature map's resolution, which makes the Acquired feature map preserve more details and features of the original feature map.
We also introduce group convolution and change the filter size of deconvolution to decrease the parameter quantity caused by deconvolution.Our DS-DeConv method improves network model accuracy in up-sampling with a slight increase in parameters.Fig. 5 illustrates the principal diagram of DS-DeConv, while the number of deconvolution groups is adjusted based on the channel quantities in the network.

D. Rotationally Detection
Currently, bounding boxes in object detection consist of HBBs, rotated bounding boxes, and custom bounding boxes.The characteristics of remote sensing detection include the random and diverse directions of the objects to be detected.And to achieve more accurate detection of these rotating objects, the rotating bounding box is used for it.The rotated detection method based on parametric regression mainly consists of the five parameters and the eightparameter method.However, in rotation detection, the target parameters for learning are periodic, which causes the learned parameters to be located at the boundary periodicity, resulting in discontinuity issues and an abrupt rise of loss.Therefore, we use CSL [22] to solve the boundary discontinuity problem, as depicted in Fig. 6.The CSL is expressed as follows: where g(x), r, and θ represents the window function, radius, and the current bounding box angle, respectively.By converting angle prediction from a regression task to a classification task, the boundary discontinuity issue can be effectively resolved with minimal loss of accuracy.

IV. EXPERIMENTS AND RESULTS ANALYSIS
A. Datasets 1) DOTA Dataset: The DOTA dataset [10] contains 2806 high-resolution aerial images collected from various sensors and platforms and encompasses 15 categories.It is split into three subsets for training, validation, and testing, including 1411 images, 458 images, and 937 images, respectively, containing 188282 instances in total.The image size varies from 800 × 800 to 4000 × 4000 pixels.

B. Implementation Details and Evaluation Index
Considering the adverse influence of high and inconsistent resolution images, we reprocess the original data of these two datasets.For the DOTA dataset, we cropped the images to 1024×1024 resolution with 200 pixels overlapping area.Then 15749 images were extracted for training and 5297 images for evaluation, and the final test results are obtained through the official evaluation server.The network is trained with the SGD optimizer in the training process.The lr (learning rate) is 0.001, and momentum and weight decay are 0.937 and 0.0005.We train 300 epochs with batch size 16 on two GeForce RTX 3090 GPUs.For the HRSC2016 dataset, we resized all the images to (768, 768).The network is trained with the SGD optimizer for training.The learning rate is 0.01, and momentum and weight decay are 0.937 and 0.0005.We train 200 epochs with batch size 8 on GeForce RTX 3060 GPU.
We adopt the Average Precision (AP) and the mean AP (mAP @0.5) metric in the comparative experiments to evaluate the multi-class detection accuracy.They can be calculated as follows:

P =
T P T P + F P (5) T P is the correctly classified target number, while F P is the background number recognized as target.The accuracy rate P can be defined as the proportion of correctly detected targets among all detection results.The mAP is the average of AP values of all classes.In the ablation experiments, FLOPs and speed are also used to estimate the differences in algorithm capability.Speed is also used to estimate the differences in algorithm capability.

C. Ablation Experiments
In this section, we choose YOLOv7 as the baseline model to conduct ablation experiments on the DOTA dataset to verify the effectiveness of the introduced DCoT block, DS-DeConv, and CSL.It should be noted that this paper aims to address the problem of rotated RSI detection, so unnecessary ablation experiments on horizontal detection are not shown.The batch   size for training was 16, and the performance metrics were evaluated every 10 epochs during the training process.A total of 300 iterations were completed to train both the baseline and improved models.FLOPs, speed, and mAP are used as evaluation indicators in the experiments.Table I shows the results of our improvements and Table II shows detailed AP values of each category conducted on the DOTA dataset.And the bold font is the best result.
As seen from Table I speed and mAP of the OBB task are commonly lower than those in the HBB task, which is attributed to the angle issue when serving the rotated detection task.Attentively, to ensure the effectiveness of the baseline, its experiments were all performed at 640 * 640 resolution, while other experiments were conducted at 1024 * 1024 resolution.And the baseline speed is 40.98 at 1024 * 1024 resolution.Despite the speed and mAP having decreased, the effect has been improved in the actual detection(see Fig. 7).In the horizontal task, compared with the original YOLOv7, 2 ⃝ 3 ⃝ 4 ⃝ 5 ⃝ showed improvement of 1.9%, 3.4%, 2.7% and 5.5%.Relative to the YOLOv7 with CSL added, 3 ⃝ 4 ⃝ 5 ⃝ achieved 0.41%, 1.05% and 3.25% improvement.According to Table II, it can be found that the proposed method has greatly improved in the categories of small vehicles, harbors, and ships, obtaining 16.2%, 35.9%, and 8% improvement, respectively, compared with the baseline model.
In Fig. 7, three images are chosen for comparing the detection results from the dataset.The results of the two rows are the baseline model, and the D2-Net model proposed in this paper, respectively.There are plenty of small and dense objects in the leftmost images of Fig. 7

D. Comparison with other OBB Methods
In this section, we choose the YOLOv7 as the baseline.We compare our model performance with other state-of-the-art methods for the DOTA-v1.0 and HRSC2016 datasets.In compared models, RoI Trans [16], RODFormer [17], and CLT-Det [18] adopted a hybrid network using CNNs and transformer blocks, and the others applied pure CNNs structure.1) Results on DOTA-v1.0:As reported in Table III, The comparative experiments on the DOTA dataset consist of the OBB and HBB tasks.In the OBB task, we achieved the mAP of 77.96%, which gains 1.79% higher than the CSL with CNNs structure, and 0.51% higher than CLT-Det with a hybrid framework.Moreover, the prediction performance on densely distributed small objects, like storage tanks and small vehicles, has improved enormously, reaching 89.71% and 81.71%, which are 3.02% and 2.05% higher than the second best, respectively.Besides, soccer ball fields, large vehicles, and helicopters also perform well, reaching 76.62%, 87.4%, and 82.87%, respectively.In the HBB task, the proposed model is 5.5% (from 73.70 to 79.20%) higher than the baseline.The top-3 mAP is plane, tennis court, and ship, achieving 98.3%, 98.1%, and 97.5%, respectively.In general, the above statement demonstrates the effectiveness of our model, and Fig. 10 visualizes some detection results of our method on the DOTA dataset.

V. CONCLUSIONS
In this paper, we proposed an effective one-stage model called D2-Net for rotated remote sensing image detection based on the YOLOv7 model.we innovate the DCoT block combining dilated convolution with contextual transformer block for feature extraction and enhancing the ability to detect Objects with tiny sizes and dense distribution of RSIs, which can fully utilize the global and local information of objects and enlarge the receptive field.Then, We designed the DS-DeConv for up-sampling, which mitigates the effects of complex backgrounds and low resolution.It improves the resolution and quality of the up-sampled feature maps, enabling the detector to capture the details and shapes of the targets more effectively.Additionally, the CSL is employed for determining the angle loss and accomplishing the prediction of rotated objects in RSIs.In the end, we conducted experiments on the DOTA and HRSC2016 datasets to prove the effectiveness of D2-Net.Although detection capability surpasses other commonly employed algorithms, the speed and FLOPs has decreased.Thus, we will further enhance the feature representation and improve the model's detection speed with a more lightweight model.

Fig. 1 .Fig. 2 .
Fig. 1.The overall structure of our network.The SPPCPSC,MP, RepConv are modules of the original YOLOv7.And the detailed composition of each block in Fig. 1 is illustrated in Fig. 2 and Fig. 3.

Fig. 3 .
Fig. 3.The detailed structure of the DCoT block and its module.H, W , and C denote the height, width, and number of channels of the input data X, ⊛ denotes local matrix multiplication.

Fig. 4 .
Fig. 4. The architectures of the ELAN block and ELAN-D block.(a) shows the detailed ELAN structure, and (b) shows the detailed ELAN-D block.
includes 1061 remote sensing images from six distinct ports.The dataset is divided into three parts, 436 images for training (a total of 1207 labeled examples), 444 images for testing (a total of 1228 labeled examples), and 181 images for validation (a total of 541 labeled examples).The images have varying resolution, ranging from 300 × 300 to 1500 × 900 pixels.

Fig. 7 .
Fig. 7. Some contrastive detection results.(a) is the result of the baseline; (b) is the result of the D2-Net.And the differences are highlighted in red.
(a) and Fig. 7(b).It can be seen from the red highlights that the baseline model loses some targets, while the proposed model detects them very effectively.The background of the middle image is similar to the object, and the baseline's results are affected, while the proposed model works well.The right image contains many targets with large aspect ratios, and the D2-Net is more accurate than the baseline when boxing targets and no targets are lost.In Fig. 8(a) and Fig. 8(b), to prove the feature extraction capability of the DcoT modules, we made the first 32 feature maps visualization in the same stage of both baseline and the D2-Net.It can be observed that the proposed model can effectively eliminate irrelevant information from the background and has good extraction capability for detecting targets.It is the DCoT modules that enable the network to fully utilize feature information and concentrate on detecting targets with distinguishable features.Fig. 9(a) and Fig. 9(b) show the first upsampling heatmaps of the baseline and D2-Net.The latter preserves more useful feature information around objects and eliminates unnecessary noise.It proves the DS-DeConv's effectiveness when detecting small targets.

Fig. 11 .
Fig. 11.The visualization of the detection results of our method on the HRSC2016 dataset.

TABLE I .
THE RESULT OF THE ABLATION EXPERIMENT

TABLE III .
OBB TASK PERFORMANCE COMPARISONS ON THE DOTA-V1.0 TEST SET (AP (%) FOR EACH CATEGORY AND OVERALL MAP @0.5 (%).IN THE COLUMN, THE BOLD DENOTES THE BEST DETECTION RESULTS

TABLE IV .
PERFORMANCE COMPARISONS ON THE HRSC2016 OBB TASK.THE BEST RESULT IS HIGHLIGHTED IN BOLD