Dual-Branch Grouping Multiscale Residual Embedding U-Net and Cross-Attention Fusion Networks for Hyperspectral Image Classification

—Due to the high cost and time-consuming nature of acquiring labelled samples of hyperspectral data, classification of hyperspectral images with a small number of training samples has been an urgent problem. In recent years, U-Net can train the characteristics of high-precision models with a small amount of data, showing its good performance in small samples. To this end, this paper proposes a dual-branch grouping multiscale residual embedding U-Net and cross-attention fusion networks (DGMRU_CAF) for hyperspectral image classification is proposed. The network contains two branches, spatial GMRU and spectral GMRU, which can reduce the interference between the two types of features, spatial and spectral. In this case, each branch introduces U-Net and designs a grouped multiscale residual block (GMR), which can be used in spatial GMRUs to compensate for the loss of feature information caused by spatial features during down-sampling, and in spectral GMRUs to solve the problem of redundancy in spectral dimensions. Considering the effective fusion of spatial and spectral features between the two branches, the spatial-spectral cross-attention fusion (SSCAF) module is designed to enable the interactive fusion of spatial-spectral features. Experimental results on WHU-Hi-HanChuan and Pavia Center datasets shows the superiority of the method proposed in this paper.


INTRODUCTION
Hyperspectral images (HSI) include a wealth of spatial and spectral information [1], which can accurately characterize the physical attributes of features, enhance the ability to discriminate features, and bring great convenience to feature recognition.However, classification of hyperspectral images has its own special problems, such as the redundancy of information in spectral bands [2], the scarcity of training sample data [3], and class imbalance, which bring great challenges to hyperspectral image classification (HSIC).
Traditional HSIC classification methods, such as linear classifier [4], support vector machine [5] and random forest [6], can achieve good classification effect through improvement, but many original traditional methods rely on manual features, and the classification effect is poor when the number of samples is small and the HSI data dimension is high.Therefore, Principal Component Analysis (PCA) has been applied to HSIC by a large number of scholars [7][8][9][10].By compressing the original data to reduce the spectral dimension, the information redundancy between bands and the possible Hughes phenomenon can be avoided, which provides an effective treatment for subsequent feature extraction and enables the network to obtain higher classification accuracy.
With the development of deep learning, the encoderdecoder (U-net) [11] specially designed for biomedical image segmentation has been gradually applied to the field of hyperspectral image classification, which can obtain superior results with less training data.In the absence of datasets, Lin et al. [12] introduced U-Net to solve the problem of complex data capture in practice.Paul et al. [13] combined spectrum partitioning to reduce the redundancy of the spectrum, and then designed U-Net architecture by introducing deep separable convolution to reduce overfitting problems.Besides, due to the clear network structure of U-Net, any customized layer can be easily integrated into the existing network.For example, He et al. [14] embedded the Swin transformer into the classical CNN-based U-Net, which is dedicated to acquiring global contextual information of remote sensing images and obtaining deeper features in the master encoder.Xiao et al. [15] improved the spatial resolution of HSI by fusing spatial features of different scales and depths in the MSI for U-Net.
Moreover, in order to improve the classification performance of hyperspectral images, it has become a major research direction to jointly use spectral and spatial information to design classifiers.The construction of spatial and spectral information through dual branches can make full use of the information.Yang et al. [16] constructed a dual-channel CNN, extracting spectral and spatial information in each channel separately, and then connecting the spatial-spectral features by using cascade, but this simple feature connection cannot capture the complex relationship between the spatial-spectral features.Wang et al. [17] used the grouping strategy and the Long Short-Term Memory (LSTM) model to perceive spectral multi-scale information and obtain spatial context features in spectral finite element and spatial sub-network.Considering the different importance of spectral and spatial components, they used the method of adaptive feature combination for fusion.For effective fusion of spatial-spectral information, Sun et al. [18] designed a weighted self-attention fusion strategy, which combines the output weights of each branch of the previous network with the output weights of self-attention, and obtains efficient fusion on a multi-structured network.Yang et al. [19] used a dual-branch fusion mechanism to promote the exchange of feature information between the two branches through two upstream and downstream modules, so that local fine-grained features could be constructed in more detail and www.ijacsa.thesai.orgglobal context information could be better utilized.These works provide new ideas for dual-branch feature extraction and fusion in HSIC.
In general, the method based on U-Net can better learn the representation of the input natural image, which is conducive to the classification of hyperspectral images to obtain high accuracy and obtain satisfactory results, but some small size information will be lost in the process of down-sampling. the design of fusion mechanism under two-branch conditions will also affect the effectiveness of the network.In this context, we propose a dual-branch grouping multiscale residual embedded U-Net and cross-attention fusion network.Among them, the main contributions are as follows:  A spatial-spectral cross-attention fusion module (SSCAF) is designed to cross-fuse the spatial and spectral features generated by the double branch, that is, to fuse the parameters of the other branch into its own branch, increase the interaction of the two branches, and promote the full fusion of the two branches.
The rest of the paper is organized as follows.Section II describes the general framework of the DGMRU_CAF network, GMR and SSCAF, respectively.Section III discusses the dataset, the experimental settings, the experimental results and the discussion.Finally, in Section IV, conclusions are given.

A. The Overall Framework of DGMRU_CAF
The DGMRU_CAF proposed in this paper is composed of DGMRU, SSCAF and classification network, as shown in Fig. 1.The DGMRU is divided into a spatial GMRU branch which takes the HSI neighbourhood block as input and a spectral GMRU branch which takes the spectral band as input.Each branch extracts corresponding features from the combined paths of U-Net and GMR with different nuclear scales, so as to obtain deeper feature information.In this regard, the designed GMR enhances the model's perception of multiscale spatial and spectral scales by grouping, multiscaling, and residual connection to retain more detailed feature information.Afterwards, in order to jointly utilize spatial and spectral information, the SSCAF module is constructed.Under the guidance of its own features, the module introduces the features of another branch and carries out interactive fusion to generate spatial-spectral features.Finally, in order to obtain the classification results of HSI, the obtained spatial-spectral features are passed through a classification network consisting of a fully connected layer and a softmax activation layer.

B. The GMR Module
In this paper, a GMR module is proposed to retain more features without increasing parameters.For each branch, spatial GMR and spectral GMR are designed respectively.1) Spatial GMR: As shown in Fig. 2(a), under the branch of spatial GMRU, for the intermediate features of the input space, spatial GMR uses the grouping module to group its spatial channels in sequence, so that each group of vectors contains different channel information, and each group of spatial vectors is expressed as: where, is the characteristic information corresponding to the channel in the segment, t represents the number of channels in each group, and g represents the number of groups.
In the process of down-sampling, the features of small-size objects are easy to be weakened and lost, and it is difficult to recover these features by up-sampling, which leads to the misclassification of small-size objects.In order to solve this problem, this paper uses convolution of different sizes for multi-scale feature extraction after grouping to capture local features inherent in space.The convolution output of each group is:  where, is the feature vector of the ith group, is the weight coefficient of the ith group, and is the bias coefficient of the ith group.Then, in order to complement the context information of features at different scales, all groups are merged by a cascade method.Finally, the rich low-frequency information is transmitted directly through the residual connection, which speeds up the training of the network.In short, using spatial GMR can extract more representative fine features.
2) Spectral GMR: Hyperspectral images contain a lot of spectral information, but the spectral information is redundant, which is easy to produce Hughes phenomenon and affect the classification results.In order to cope with this problem, and effectively capture the local relevant information of the spectral band.As shown in Fig. 2(b), for the spectral intermediate features, the grouping module of spectral GMR is used to group their spectral dimensions in sequence, so that each group of spectral vectors contains different spectral band information.Among them, the number of spectrum contained in each group and the distance between spectrum are related to the number of divided groups.Each set of spectral vectors is represented as: Where, is the characteristic information of the spectral dimension in the segment, represents the number of spectral bands in each group, and represents the number of groups.
After that, convolution of different scales is used to extract the grouped spectral features, so as to weaken the correlation between spectrums and reduce the redundancy of information.After that, the cascade method is used to merge the output spectral features of each group, which complements the local information of the spectral features of different scales and makes full use of the correlation between the spectral bands.Finally, the original global information is propagated directly by residual connection, which alleviates the problem of gradient degradation.In conclusion, the global and local information of spectra can be fully extracted by spectral GMR.

C. The SSCAF Module
Considering the complementary characteristics between spatial and spectral features, in order to promote the effective fusion of these two types of features, a spatial-spectral crossattention fusion (SSCAF) module is proposed in this paper.As shown in Fig. 3, the module is a combination of a cross selfattention module, a positional self-attention module (PAM) and a channel self-attention module (CAM).The cross selfattention operation is defined as follows: The represent the spatial and spectral feature vectors generated by the two branches, respectively, the function produces the adaptive weight vector between the two vectors, the function produces the feature representation of the input individual input vectors, and the normalization factor ss is defined as ∑ .
To further establish internal connections, PAM and CAM modules are introduced to refine spatial and spectral features.Finally, the feature information is summed and www.ijacsa.thesai.orgcomplementarily fused to obtain the final spatial-spectral fusion feature.

A. Dataset and Experimental Setting
In this section, to demonstrate the validity of the proposed method, we conduct a number of experiments on two datasets, which include WHU-Hi-HanChuan (HC) [20], [21] and Pavia Centre (PC).We divided the label samples in different ways for each data set.Table Ⅰ and Table  In the process of training the model, some parameters are set, where the training epoch is set to 200, the batch size is 16, the learning rate is 0.001, the weight decay is 1e-5, and the training is repeated 10 times for all the datasets.In order to prove the superiority of the proposed method, this paper conducts comparative experiments with six advanced methods, namely 2DCNN [22], SSRN [23], A2S2K [24], ASSMN [17], U-Net [11], HyperUnet [13].The overall accuracy (OA), average accuracy (AA), Kappa coefficient and classification accuracy of single-class are used as the performance evaluation criteria of the model.The higher the each index, the better the classification effect will be.The results on the HC dataset are shown in Table Ⅲ, with the best OA, AA, and Kappa results highlighted in bold.Fig. 5 shows the classification diagram for the different methods.
It can be seen from Table Ⅲ that our method achieves the best performance, with OA of 96.22%, AA of 96.62%, and Kappa of 95.57%.Compared with other methods, OA, AA, and Kappa are increased by at least 0.8%, 0.76%, and 0.92%.This is because the proposed method has the new idea of combining dual-branch and U-Net, which improves the ability of convolutional feature extraction, so that the method in this paper can achieve the best performance.The grouping multiscale residual block is designed to extract features with different kernel sizes in each group, and reduce the loss of feature information to construct effective feature extraction.The classification results of HSI prove the validity of the method.In addition, it can be seen that the OA of 2DCNN is www.ijacsa.thesai.org the lowest, only 78.91%, which is because 2DCNN is trained only on the spatial dimension, ignoring the information between spectrum, and the model performance is poor.Compared with 2DCNN, U-Net constructs U-shaped network structure, improves classification accuracy and performance, and improves 13.38%, 15.84% and 15.3% respectively in the three evaluation criteria.HyperUnet networks, which combine U-Net and grouping ideas, perform poorly on this dataset, possibly because of poor adaptability to large datasets.In addition, it can be observed that the evaluation value of SSRN is comparable to that of U-Net.SSRN extracts spatial-spectral features through the combination of two continuous spectral blocks and spatial blocks.However, the input of the spatial block comes from the spectral block, which leads to the loss of some spatial information in the spectral block, resulting in poor classification accuracy.The OA of A2S2K is better than that of SSRN, increased by 1.69%, which indicates that the introduction of attention mechanism and adaptive methods has a significant impact on the network.Compared with other single-branch algorithms, the dual-branch ASSMN results in better OA values, which indicates that the full use of spatial and spectral feature information can achieve superior classification results, and the effect is much better than that of single spatial or spectral information.Although the effectiveness of the method in this paper is inferior to other algorithms in some categories, the results of these methods are very close to the results of the best classification, so the OA, AA and kappa coefficients of the method in this paper are the highest among these methods.
From the classification diagram shown in Fig. 5, the "salt and pepper" noise is the most severe because spectral information is not included in 2DCNN, while the classification diagram of other networks shows stronger classification ability because spectral information is taken into account.The method proposed in this paper considers the spatial features of different scales and solves the redundancy problem to obtain more small-size objects and feature information.Therefore, for classification maps with more small sizes, the method proposed in this paper is easier to obtain more accurate and cleaner classification maps, and the classification results of various categories correspond to the results in Table III   The results on the PC dataset are shown in Table IV, with the best OA, AA, and Kappa results highlighted in bold.Fig. 6 shows the classification diagram for the different methods.
As shown in Table IV, on this PC dataset, all methods, including 2DCNN, achieve decent classification results.Obviously, the AA of both 2DCNN and SSRN are lower than 90%, which is due to poor classification accuracy in some categories, and the classification accuracy of some categories is less than 80%.The values of A2S3K, U-Net and HyperUnet in OA, AA and Kappa all reach more than 90%, but it is still difficult to improve the classification accuracy for some categories.As a two-branch multi-scale network, ASSMN has more stable classification results.Method of this paper is superior, has the best OA, AA and Kappa evaluation values, and achieves the best accuracy for some specific categories, such as Class 4 Self-Blocking Bricks and Class 5 Bitumen, which further proves its validity in terrain classification.
As shown in Fig. 6, our method is smoother and more consistent.

C. Ablation Analysis
In this part, extensive ablation experiments are conducted to demonstrate the validity of the proposed GMR, SSCAF on the two datasets.
The validity analysis of GMR is shown in Table Ⅴ.It can be seen that without GMR, the values of OA, AA and Kappa of the model are the lowest in the experiment, because it will lead to some small-size samples being ignored in the process of down-sampling.In contrast, the simultaneous presence of GMR modules with two branches can extract spectral and spatial features more effectively, which contributes to the final classification, and its OA, AA, and Kappa can achieve the best results compared with other comparison strategies.Among them, OA increased by 2.6% and 1.51% in the two datasets, respectively, which means the necessity of GMR.In addition, the OA value of "Only Spe-GMR" is higher than that of "Only Spa-GMR", because the HSI contains enough spectral information to extract more useful feature information from it.
The results of SSCAF ablation experiments are shown in Table VI.It can be found that the integration of SSCAF into the two branches of "With GMR" has significantly improved network performance, which means that SSCAF can complement each other with spatial and spectral information to contribute to the final classification decision.Compared with without SSCAF, OA is increased by 4.54% and 3.03%, AA is increased by 3.14% and 3.59%, and Kappa is increased by 5.32 and 3.46, respectively, which fully proves the necessity of the existence of SSCAF.Both U-net and HyperUnet have encoding and decoding path modules, and the addition of more convolutional layers makes the consumption time slightly longer than that of 2DCNN.SSRN and A2S2K use 3D convolution and introduce ResNet, which speeds up convergence and reduces training time.In ASSMN, the combination of dual-branch and multi-scale, together with its strategy of spectrum grouping and spatial grouping, makes the model more complex and requires longer training and testing time.However, the training time and testing time of the method in this paper are average among these comparison methods.The attention mechanism used in the SSCAF module increases the complexity of the proposed network, and the obtained training time and testing time are not the shortest.However, the method proposed in this paper can strike a good balance between accuracy and efficiency, and has certain advantages.

IV. CONCLUSION
In this paper, we propose a dual-branch grouping multiscale residual embedded U-Net and cross-attention fusion network for hyperspectral image classification to improve the classification accuracy in the presence of sparse training samples.The designed DGMRU module is used to extract multiscale context information feature, which is suitable for the case of insufficient HSI samples.Among them, the designed GMR module increases the receptive field without adding parameters, and the feature extraction effect is better than that of the non-existent GMR module, which proves the necessity of this module.In addition, the proposed SSCAF maximizes the utilization of spatial-spectral features by constructing the intrinsic relationship between spatial and spectral features through cross-attention.Compared with other advanced algorithms, the method proposed in this paper has the best experimental results, and in the two data sets, OA increases by 0.8% and 0.47% at least, which is feasible and effective.In the future, we will consider further reducing the complexity of the network model and improving the computational efficiency while maintaining the classification accuracy.

Fig. 4 .
Fig. 4. False color maps for the two datasets.(a) False color map of HC, (b) False color map of PC. .

TABLE I .
SAMPLE INFORMATION FOR EACH CLASS IN THE HC DATASET

TABLE II .
SAMPLE INFORMATION FOR EACH CLASS IN THE PC DATASET B. Analysis of Classification Results of Dataset Classification Maps and Result of HC Dataset.

TABLE III .
CLASSIFICATION RESULTS OF THE HC DATASET 57 www.ijacsa.thesai.org Classification Maps and Result of PC Dataset

TABLE IV .
CLASSIFICATION RESULTS OF THE PC DATASET

TABLE VI
Discussion of Training Times and Testing Times In order to measure the efficiency of the proposed method, this paper conducted comparative experiments in training and testing time, and the results are shown in Table Ⅶ.2DCNN has the least training time and testing time than other methods, because the simple 2DCNN architecture has fewer training parameters, but the classification accuracy is relatively low.

TABLE VII .
RUNNING TIME OF DIFFERENT METHODS ON TWO DATASETS