Attention-based Cross-Modality Multiscale Fusion for Multispectral Pedestrian Detection

—Multispectral pedestrian detection has wide applications in fields such as autonomous driving and intelligent surveillance. Mining complementary information between modalities is one of the most effective approaches to improve the performance of multispectral pedestrian detection. However, the inevitable introduction of redundant information between modalities during the fusion process leads to feature degradation. To address this challenge, we propose a multiscale differential fusion algorithm that leverages complementary information between modalities to suppress feature degradation caused by noise propagation along the network. We compare our algorithm with other cross-modal fusion pedestrian detection algorithms on the LLVIP and cleaned KAIST datasets. Experimental results demonstrate that our algorithm outperforms others, particularly in nighttime scenes where our algorithm achieves a 7.28% improvement in recall rate compared to the baseline on the cleaned KAIST dataset.


I. INTRODUCTION
Pedestrian detection plays an important role in autonomous driving systems.In well-illuminated conditions, pedestrian detection achieves high precision.In poor lighting conditions, the appearance of pedestrians becomes blurred.Obstacles, overlapping figures, and varying distances contribute to these differences.As a result, nighttime pedestrian detection currently faces significant challenges [1].
Many advanced algorithms based on visible light images achieve notable performance improvements.Recent studies involving these images have validated their effectiveness, including in nighttime environments [2].Due to the poor quality of nighttime visible light images, deep convolutional neural networks struggle to learn effective features.Image enhancement techniques show remarkable performance in enhancing the contrast between the foreground and background of an image.Some studies utilize enhanced image for feature extraction [3].However, the majority of machine vision and deep learning models tend to perform poorly in highly challenging low-light scenarios [4].Infrared images can highlight the thermal radiation characteristics of target objects, allowing for the capture of details such as human contours.Therefore, it possesses unique advantages in scenarios with insufficient lighting, adverse weather conditions, or concealed surveillance.
Despite the significant advantages of multimodal input data, effectively fusing information between modalities has become the core challenge and focus of algorithmic research.
Li [5] et al. compared six fusion architectures which integrate color and thermal modalities at different position.Based on different fusion stages, it can be classified into early fusion, halfway fusion, and late fusion.Late fusion is currently the more commonly employed method, capable of mitigating the influence of modality and feature misalignment.However, it encounters challenges in network convergence and high computational complexity.We observed that discussions rarely address both the redundancy and complementarity of modalities.Crucially, the spread of redundant information can have detrimental effects in networks.This paper focuses on examining and mitigating these negative impacts by leveraging differential information of modalities in the backbone network to reduce redundancy.
The Non-Local neural network (Non-Local) [6] enhances inputs by calculating similarity in the channel direction.We conjecture that constructing an attention map by calculating similarity between pedestrian features could effectively allocate increased attention to those with blurred characteristics.In multispectral scenarios, there also exists a certain level of correlation both between channel dimensions and between spatial dimensions.Therefore, this paper proposes a dual-branch attention mechanism, named Dual Non-Local, which is based on both channel and spatial information.It establishes long-range dependencies between channels and spaces.Simultaneously, we utilize bright channel prior (BCP) algorithm to address low-light image compensation issues, and employ a multiscale feature fusion module to integrate visible and infrared modalities.Our work achieves superior results compared to some methods on the public available datasets KAIST and LLVIP.
We summarize the contributions of our work as follows: 1) A novel fusion approach is proposed for mining complementarity and reducing feature degradation.This technique involves the cross-fusion of complementary information from different modalities within the backbone network.The outputs of the backbone network for each modality is effectively integrated through this method.2) A dual-branch attention mechanism based on channel and spatial attention.We embed positional information into attention map, and reduced the computational complexity of spatial attention.Ultimately, we build a dual-branch 3D attention mechanism that collaborates between spatial and channel dimensions.

A. Pedestrian Detection
Pedestrian detection has high practical value in various applications, eg., autonomous driving and video surveillance.It receives extensive research attention in the field of computer vision.Pedestrian detection has undergone a significant transformation from handcrafted features to depending on deep convolutional networks for feature extraction [7].Based on channel features or Deformable Part Models (DPM), there are two approaches to pedestrian detection that rely on handcrafted features.In 2009, P. Dollar et al. offered a fresh approach Integral Channel Features (ICF) [8], which utilized integral images for rapid feature computation.By combining channel feature pyramids with a cascaded classifier, they achieved faster detection results.ICF was the basis of channel features.Filtered Channel Features (FCF) [9] was optimization methods derived from ICF. Conventional algorithms were contingent upon manual design and frequently yielded diminished levels of detection accuracy.With Convolutional Neural Networks (CNNs) demonstrating outstanding feature extraction capabilities across various object detection tasks, pedestrian detection methods focused on leveraging deep learning techniques to enhance detection performance recently.The emergence of single-stage and two-stage algorithms, such as Faster R-CNN [10], [11], was a substantial potential for advancing accuracy and speed in pedestrian detection.Once in all weather conditions, especially during nighttime scenes, visible-light-based detection methods struggle to be effective.Simultaneously, infrared images complements visible light images, enabling the capture of pedestrian contours even in nighttime conditions.Detecting pedestrians in all weather conditions using multispectral images of color-thermal pairs has become a research hotspot.

B. Multispectral Pedestrian Detection
Effectively integrating infrared and visible light modalities is a challenging problem.In 2015, Hwang [12] et al. collected multispectral datasets, KAIST.The authors proposed the multispectral Aggregated Channel Features (ACF) method, incorporating intensity and gradient information from the thermal channel as additional channel information.An increasing number of multispectral pedestrian detection algorithms emerged based on this dataset.Liu [13] confirmed that multimodal pedestrian detection outperforms single-modal detection in terms of performance.Liu also investigated four fusion architectures: early fusion, mid-fusion, late fusion, and confidence fusion.They concluded that halfway fusion is the most effective fusion architecture.Inspired by Faster R-CNN, Konig [14] et al. proposed an effective multispectral RPN (Region Proposal Network)+BDT (Enhanced Decision Tree) model.In addition to investigating the fusion stages of multispectral images, another research approach involved using an illumination-aware network to weight the two modalities.Illumination-aware Faster R-CNN (IAF R-CNN) [5] introduced an illumination-weighting mechanism, forming a unified detection framework with separate subnetworks for visible light and infrared, along with a weighting layer.That means in low-light conditions, the network emphasized the features learned from the infrared sub network.In well-illuminated conditions, it focused on the visible light subnetwork.Our work is closely related to the conclusions drawn in [14].We employed YOLOv7 [15] as our baseline and investigated the positive impact of low-light image enhancement techniques on the performance of multispectral pedestrian detection.

C. Attention Mechanism
In deep learning, the attention mechanism emulates the human visual and cognitive system, enabling neural networks to focus attention on relevant parts.Due to its outstanding performance, the attention mechanism is widely utilized in machine vision.Squeeze-and-Excitation Networks (SENet) [16] achieved adaptive channel-wise feature recalibration by modeling interdependencies between channels.Convolutional Block Attention Module (CBAM) [17] combined a channel attention module with a spatial attention module, allowing channel attention and spatial attention to operate sequentially.This enabled the network to simultaneously learn dependencies between channels and positional information.Non-Local neural network [6] combined self-attention with the general non-local mean method, establishing a long-range dependency model for transmitting long-range information.Non-Local maintained consistent feature scales between input and output, so it can be employed without modifying the network architecture.Criss-Cross Attention Network (CCNet) [18] and Global Context Network (GCNet) [19] were improvements derived from Non-Local.Similarity-based Attention Module (SimAM) [20] suggested that attention in the human brain often work in synergy, thus a unified attention mechanism was more in line with the working mechanism of neurons in the human brain.This paper introduces a new attention mechanism that combines the ideas of SimAM and Non-Local.

III. PROPOSED METHOD
The structure of this paper is depicted in Fig. 1.This model utilizes YOLOv7 as the baseline and integrates it with an illumination compensation module, a multiscale fusion module, and a detection module, forming a unified detection architecture.Our model consolidates the methods of image enhancement and differential fusion into a cohesive framework, thoroughly addressing the redundancy and complementarity across different modalities.

A. Illumination Compensation Network
The atmospheric scattering model is commonly employed to represent the degradation process of hazy images and is sometimes used for image enhancement tasks in low-light conditions as well [21].Original image captured by the camera can be expressed as: where, I is original image, J is restoration function of the image, A is environmental light description function, and t is medium propagation description function.
Wang [22] demonstrated that well-exposed images have at least some pixels with high illumination, unless these pixels are in shadow or covered by a black object.We visualize the Bright Channel on the KAIST dataset in Fig. 2. Most visible light images on the KAIST dataset are in underexposed scenes.We adopt unsupervised low-light Image enhancement network www.ijacsa.thesai.orgfor outputing adjusted image that is guided by a unsupervised loss L BCP .The parameters in Eq. ( 1) are reinterpreted as Eq. (2): where, A and t p is environment light and the illumination map, respectively.J p represents enhanced output images and I p is observed images.According to BCP [22], We adopted Eq. ( 3) as the brightest intensity: where, J bright p represents the brightest intesity in r,g,b channels, q is the pixels centered at p, and c is different channels in RGB images.Additionally, the brightest intesity becomes J bright p → 1.Assuming that A is known, t represents the illumination map, which is considered as a constant within a patch.
By taking the maximum operator between the left and right sides of Eq. ( 3), we obtain an initial illumination map formulation as shown in Eq. ( 4): where tp is the illumination value at pixel p, I c q is observed image, q is pixel centered at p, and c is different channels in RGB images.Under the supervision of the initial illumination map, we obtain enhanced illumination map t p through Illumination Compensation Network.Substituting t p into Eq.( 2), we get enhanced output images as Eq. ( 5): The darkest pixels in the bright channel of the image can be considered as environment light.To adjust dark spots and black objects in real life, we take the average value of the darkest 0.1% pixels (denoted as K) in the bright channel of the image as the environment light, as shown in Eq. ( 6): To address oversaturation, this paper similarly utilizes the output from Eq. ( 7) as the attention map: where T is thermal image, and γ(γ > 1) controls the curvature of the attention map.
The enhancement network alters the feature scale and utilizes the attention map to optimize spatial weights.In summary, our final illumination compensation network is illustrated in Fig. 3.The visible light features are compensated by the infrared images, effectively enhancing the distinguishability of the RGB images.

B. Multiscale Fusion Module
In all weather conditions, visible light images provide more information about pedestrians in well-illuminated conditions while in low-light conditions, thermal images provide more information.Most multispectral approaches extract features from two streams and directly combine them either by element-wise addition or by channel concatenation.However, these mehods overlook the complementarity between the two modalities.
The propagation of redundant information between modalities through the network also have adverse effects.
Inspired by the differential modality information [23], we propose a fusion module: Differential Fusion Module(DFM), to enhance the mutual suppression and enhancement, as shown in Fig. 4. The features obtained by element-wise subtraction of the two modalities reflect their complementary information, ingeniously excluding redundant information from feature fusion.This element-wise subtraction also prevents interference from features learned from another modality in the previous fusion from affecting the next fusion.Integrated within the architecture of YOLOv7, we perform multiscale feature fusion at the position illustrated in Fig. 1.In multispectral pedestrian detection task, it is crucial to effectively integrate valuable information between modalities and mitigate interference caused by redundant information.
We applied DFM at Conv3, Conv4, Conv5 layers of the Backbone.The outputs feed into multiscale feature fusion network.To further enhance crucial features, we employ the Dual Non-Local attention mechanism before pyramid feature network.This helps to improve pedestrian feature expression effectively.DFM involves using a differencing mechanism, where F R and F T are subtracted element-wise to obtain the feature F D .The equation for F D is as follows: where, F R and F T represent the extracted features from the visible light image and infrared image, respectively.F D is the difference between F R and F T .
Subsequently, F D is obtained through a global average pooling layer(GAP) and a tanh activation layer in order to get the attention map in the channel direction.That attention map is multiplied with the input visible light features and infrared features separately, producing D R and D T .The cross addition is applied to F T and D R , as well as F R and D T .Differential features yields the output features.
GAP computes the mean of the two-dimensional images within each channel, obtaining an attention map in the channel direction that contains global information.D R is present in the visible light features but absent in the infrared image features while D T is present in the infrared image features but absent in the visible light features.The formulation of differential feature is as follows: where T is the output of F T .After cross-complementary feature fusion at three scales in the Backbone, the deep semantic information is concatenated.The deep semantic information needs to be fed into the Dual Non-Local module to enhance crucial information before the feature pyramid network.There is a high degree of correlation between pedestrian features, so establishing long-range dependencies is beneficial for modeling the similarity relationships between pedestrian features.

C. Dual Non-Local Attention Mechanism
Capturing long-range dependencies is crucial in pedestrian detection task.Long-range dependencies in image can only be formed through the successive convolutional layers in deep neural networks, named large receptive field.Inspired by SENet and Non-Local, we proposed a 3D attention mechanism called Dual Non-Local, as illustrated in the Fig. 5. Does longrange dependency work effectively in multi-pedestrian scenes?Undoubtedly, there is some correlation among the extracted features of pedestrians.The feature similarity matrices between each pixel and the feature similarity matrices between each channel assist in providing blurred pedestrian features with the weights of clear pedestrian features.Spatial attention and channel attention were applied concurrently for enhancing pedestrian feature.Dual Non-Local Network consists of a Spatial Attention Module (SAM) and a Channel Attention Module (CAM), which share the same input, denoted as x ∈ R C×H×W .There is a similar attention map computed for each position in Non-Local Network [19], thus, we use global attention maps to reduce computational cost.
In spatial attention module, we reshape the input into m, m ∈ R 1×C×(H * W ) .A 1×1 convolutional operation is imposed to x for getting global information of channels, denoted as n, where, N is the number of pixels and y s is shared globally as a spatial attention map.
In channel attention module, this paper innovatively performs pooling operations separately in the x and y dimensions, injecting positional information into channel attention map.After pooling in the x dimension, features are denoted as F x , F x ∈ R C×H×1 , and in the y dimension are denoted as F y , F y ∈ R , correspondingly.F x performs a 1 × 1 filter, followed by a reshape function, to obtain θx, φx, where θx ∈ R C×1×H and φx ∈ R C×H×1 .Dot product is carried out between θx and φx to obtain the weights between channels refered to W x , W x ∈ R C×C in dimension y.In a similar way, W y , W y ∈ R C×C , represents the weights between channels in dimension x.
We suggest x and y dimensions play the same important role in channel attention, so the total channel weight distribution is considered to be the sum of the weight distribution in both dimension x and dimension y, denoted as W , W = W x + W y .Followed by a 1 × 1 filter and reshape operation, x generates g x , g x ∈ R C×(H * W ) , in order to obtain the final channel attention output y c by applying the learned channel attention weights (W ).To recover the features to their original input dimensions, we use a 1 × 1 filter to generate weights W z .The final channel attention output is formulated as follows: Finally, there is a shortcut between input and output as a residual structure.The network retains the input x, only learns the difference between output and input, and the data flows across layers to avoid the gradient disappearing during the training process.We denoted the final response of x as z, and we summarized the formulation between every pixel as follows: ) where i represents a position in image, and j is all possible positions.N C is number of channels, f (C Wi , C W j ) and f (C Hi , C Hj ) are the similarity between channels calculated by dot product.
The total loss is defined as (16): where N is the number of pixels, Φ(p) is the pixels within a 3×3 patch centered at p, w ij represents affinity matrix between Φ(p), and λ controls balance between the data term and the smoothing term.Integrating detection and enhancement loss during training is beneficial for obtaining image with prominent pedestrian features for pedestrian detection.

IV. EXPERIMENTS
In order to demonstrate the effectiveness of the model we proposed, we present the detection results on two datasets, cleaned KAIST and LLVIP.
A. Dataset 1) KAIST: The KAIST dataset, proposed by Hwang [12] et al., consists of multispectral pedestrian data captured by specialized hardware with a beam splitter.It comprises 95,328 pairs of color and thermal images.However, this dataset is derived from consecutive frames of a video causing a high similarity in adjacent images, so we perform data clean.Finally, we get 7601 pairs of images as training set, and 2252 pairs of images as testing set.Additionally, we adopted the re-annotated labels by Li [24] and Hangil [25] for the training and test sets, respectively, to enhance label quality.

2) LLVIP:
The LLVIP [26] dataset consists of rigorously aligned pairs of images in both time and space, which is used for pedestrian detection in low-light conditions.The entire dataset comprises 15,488 pairs of color-thermal images.

B. Evaluation Metrics
We use the Recall and Average Precision (AP) as evaluation metrics to evaluate the proposed model effectively.Here, we use TP (True Positive), FP (False Positive) to represent true positive predictions and false positive predictions, respectively.Recall is the ratio of detected pedestrians in ground truth.Recall = T P T P +F P .

C. Implementation Details
In this paper, we built our network based on YOLOv7 and added a illumination compensation network at the input, which enhances the visible light by using the bright channel prior.In the Backbone, differential fusion module was performed on the feature inputs of Conv3, Conv4, and Conv5 to reduce redundancy in modal fusion.Finally, an innovative attention mechanism was added for long-term dependencies, facilitating direct transmission of high-level semantics.
The experiments were conducted on an NVIDIA GeForce RTX 4080 GPU, Intel(R) Core(TM) i7-13700F CPU, using the PyTorch framework and public code YOLOv7.We set the batch size to 8, epoch to 100, and resize input images to 640× 640.K-means clustering provided nine anchor boxes for the KAIST dataset: [44,65], [26,111] [71,152].We used some training tricks such as mosaic augmentation and random cropping to enhance the network's generalization.

D. Results Analysis
We conduct a comparison between our algorithm, Halfway Fusion and IAF R-CNN on the cleaned KAIST dataset.Here, we primarily discuss the potential advantages of our method, such as how our framework utilizes a Halfway Fusion architecture for integration, and we identify key methodologies that are beneficial in enhancing detection performance.The pedestrian detection results are presented in Table I.For the cleaned KAIST dataset, proposed method achieves the best detection performance in terms of Recall 64.17%.Compared to IAF R-CNN, our method is equally competitive, maintaining a high recall rate while our inference time is only 0.096s/image, as opposed to 0.210s/image for IAF R-CNN.We record comparison of inference time using an NVIDIA GeForce RTX 4080 GPU in Table II.This advantage is attributed to the real-time nature of the single-stage object detection algorithm, but the improvement in recall rate is due to our differential fusion module effectively mining the complementary features of pedestrian characteristics, reducing redundant noise interference in feature propagation.However, the effectiveness of DFM is limited by the requirement that the input pairs of visible and infrared images must be strictly aligned.Misaligned image pairs transmit incorrect differential information, and the noise is amplified by the network.This limitation calls for the use of more sophisticated image acquisition instruments to be adequately addressed.In Table III, IV and V, we compare our algorithm with YOLOv7.Moreover, we explored three versions input of YOLOv7: a) RGB branch; b) thermal branch; c) concat thermal image and visible light image as input.Directly concatenat-    ing the visible light image and thermal image did not lead to a significant improvement.We achieve improvements of 3.19%, 2.12%, and 7.28% on the all weather, daytime, and nighttime test sets in terms of Recall, respectively.Our method demonstrate better performance in both accuracy and Recall, indicating the effectiveness of our fusion strategy.Illumination compensation network enhances pedestrian features in lowlight scenarios.Thus, we obtained the highest performance in nighttime scenes.
There are some visualizations for enhancced images as the outputs of Illumination Compensation Network in Fig. 6.We observed that obstacles in blue boxes have a high similarity to pedestrians, especially in low-light scenarios.The illumination compensation network is advantageous in suppressing background features and enhancing foreground characteristics.That enables the network to concentrate more on pedestrian targets, free from background interference.In the third line of Fig. 6b, pedestrian feature is clearer in blue box.However, the multispectral images of color-thermal pairs must be aligned.When misalignments occur, our model leads to worse results, which requires more sophisticated image acquisition instruments.
In addition to the quantitative analysis,we also provide several qualitative results on the cleaned KAIST dataset in Fig. 7. Upon observation, it is evident that our method excels in generating precise bounding boxes and accurately detecting pedestrians, especially in challenging scenarios when com-pared to the baseline model.
Gradient-weighted Class Activation Mapping (Grad-CAM) is a method for visualizing the attention mechanisms of deep neural networks.Our Dual Non-Local module constructs a unified attention framework based on the similarity of channel and spatial features, making it particularly suitable for singleobject detection scenarios.The feature similarity between different types of targets may cause confusion in the attention map.In single object detection tasks, our Dual Non-Local module demonstrates superior performance compared to Non-Local and SimAM.These three similar attention mechanisms have consistent input and output dimensions.We removed all other modules in Fig. 1, retaining only the attention module.We replaced this position with different attention mechanisms and trained using the RGB images from LLVIP.
Using Grad-CAM, we visualized the outputs of Dual Non-Local, Non-Local, and SimAM, as shown in Fig. 8.Our Dual Non-Local model focuses more attention on the entirety of pedestrians, while Non-Local and SimAM distribute attention more precisely, but they both have the issue of some pedestrian regions not receiving attention.Comparatively, although the attention regions of Dual Non-Local are less precise, all pedestrian regions receive attention intensely.This also validates that features of clear pedestrians can rectify those pedestrians with blurred features.

E. Ablation Experiment
Our model achieves a leading performance.Nevertheless, the specific contributions of each module to the results remained uncertain.To address this, we design some ablation experiments to verify it.The comparsion results are presented in Table VI.attention mechanism.Thanks to the (15), the proposed L BCP loss also has contributions to performance improvement, by distinguishing foreground from background as effectively as possible.

2) Multiscale Fusion Module (MFM):
As shown, the onebranch methods are undoubtedly inferior to the two-branch approach.However, the crucial factor is fusion stage while halfway fusion structure achieves the best performance.For the two-branch method, we created two separate backbones to process visible and infrared images.It's worth noting that, during this experiment, we omitted the illumination compensation network.Both modalities were fed directly into their respective backbones.According to the results, it is evident that DFM plays a crucial role in improving detection performance, which resonates with our initial conjecture.Although DFM requires strictly aligned image pairs as input, this outcome provides strong experimental support for future research on pedestrian detection in more challenging environments.

F. Other Dataset
To demonstrate the generalization capability of our algorithm, we conducted experiments not only on the cleaned KAIST dataset, but also on another multispectral pedestrian detection benchmark called LLVIP.The majority of the LLVIP dataset were captured in low-light nighttime conditions.We recorded the performance of the LLVIP dataset in Table VII, with mAP as the evaluation metric.

V. CONCLUSION AND FUTURE WORK
In this paper, we investigated to integrate color-thermal image pairs effectively, leveraging the complementarity and exclusivity between modalities to enhance detection performance.We proposed an algorithm based on multiscale feature fusion.Specifically, we performed image enhancement on the input visible light image and simultaneously improved the Backbone network through integrating two modalities using differential information in Conv3, Conv4, and Conv5 convolutional layers.Our approach demonstrated outstanding performance on the cleaned KAIST and LLVIP datasets.Particularly in nighttime scenarios, we achieved a improvement of 7.28% in terms of Recall compared to the baseline on the cleaned KAIST dataset.We suggested that the proposed Dual Non-Local attention mechanism is also effective for other single object detection tasks, which is part of our future work.The findings of this paper offer a novel approach to combine image enhancement techniques and feature fusion for multispectral pedestrian detection, with potential applications beyond pedestrian detection.In our future work, we aim to further explore the complementarity between modalities and reduce redundant information between modalities in more challenging weather conditions, such as rain and snow scenarios.

Fig. 1 .
Fig. 1.The overall structure of proposed method.The network takes multispectral images of color-thermal pairs as inputs.

Fig. 2 .
Fig. 2. The bright channel visualization of KAIST.We selected several low-light images and calculated the brightest pixels in R, G, B channels for each image, denoted as bright channel.

Fig. 3 .
Fig. 3.The structure of Illumination Compensation module.T attention fed to five convolutional layers to adjust visible light feature.t is an initial illumination map.

Fig. 4 .
Fig. 4. The structure of differential fusion module.F R and F T are visible light features and thermal features.We obtain F D by subtracting element-wise.

Fig. 5 .
Fig. 5.The structure of dual non-Local network consists two branch attention mechanisms.SAM uses a shared attention map δ(n ′ ) globally for reducing computation, while CAM calculates attention maps in different dimensions.

Fig. 6 .
Fig. 6.The visualizations of enhanced features from Illumination compensation network a) some examples; b) detail comparison.

1 )Fig. 7 .
Fig. 7.The visualizations of baseline and our algorithm.It contains a) visible light images; b) baseline detection results; c) our detection results; d) groundtruth.According to the results, our method generate more target boxes correctly.Our method performs better when visible light images are in low-light condition.

TABLE I .
COMPARISON ON CLEANED KAIST DATASET IN TERMS OF RECALL

TABLE II .
COMPARISION OF INFERENCE TIME USING AN NVIDIA GEFORCE RTX 4080 GPU

TABLE III .
COMPARISON ON CLEANED KAIST DATASET FOR NIGHTTIME SCENES IN TERMS OF AVERAGE PRECISION, RECALL, AND ACCURACY.

TABLE IV .
COMPARISON ON CLEANED KAIST DATASET FOR DAYTIME SCENES IN TERMS OF AVERAGE PRECISION, RECALL, AND ACCURACY

TABLE V .
COMPARISON ON CLEANED KAIST DATASET FOR ALL WEATHER SCENES IN TERMS OF AVERAGE PRECISION, RECALL, AND ACCURACY

TABLE VI .
ABLATION RESULTS ON THE CLEANED KAIST DATASET IN TERMS OF PRECISION

TABLE VII .
COMPARISON ON LLVIP DATASET FOR ALL WEATHER SCENES IN TERMS OF AVERAGE PRECISION, RECALL, AND ACCURACY