MH-LViT: Multi-path Hybrid Lightweight ViT Models with Enhancement Training

Yating Li; Wenwu He; Shuli Xing; Hengliang Zhu

doi:10.14569/IJACSA.2024.01510107

DOI: 10.14569/IJACSA.2024.01510107

PDF

MH-LViT: Multi-path Hybrid Lightweight ViT Models with Enhancement Training

Author 1: Yating Li

Author 2: Wenwu He

Author 3: Shuli Xing

Author 4: Hengliang Zhu

International Journal of Advanced Computer Science and Applications(IJACSA), Volume 15 Issue 10, 2024.

Abstract and Keywords
How to Cite this Article
{} BibTeX Source

Abstract: Vision Transformers (ViTs) have become increasingly popular in various vision tasks. However, it also becomes challenging to adapt them to applications where computation resources are very limited. To this end, we propose a novel multi-path hybrid architecture and develop a series of lightweight ViT (MH-LViT) models to balance well performance and complexity. Specifically, a triple-path architecture is exploited to facilitate feature representation learning that divides and shuffles image features in channels following a feature scale balancing strategy. In the first path ViTs are utilized to extract global features while in the second path CNNs are introduced to focus more on local features extraction. The third path completes the representation learning with a residual connection. Based on the developed lightweight models, a novel knowledge distillation framework IntPNKD (Normalized Knowledge Distillation with Intermediate Layer Prediction Alignment) is proposed to enhance their representation ability, and in the meanwhile, an additional Mixup regularization term is introduced to further improve their generalization ability. Experimental results on benchmark datasets show that, with the multi-path architecture, the developed lightweight models perform well by utilizing existing CNN and ViT components, and with the proposed model enhancement training methods, the resultant models outperform notably their competitors. For example, on dataset miniImageNet, our MH-LViT M3 improves the top-1 accuracy by 4.43% and runs 4x faster on GPU, compared with EdgeViT-S; on dataset CIFA10, our MH-LViT M1 improves the top-1 accuracy by 1.24% and the enhanced version MH-LViT M1* by 2.28%, compared to the recent model EfficientViT M1.

Keywords: Multi-path hybrid; lightweight ViT; normalized knowledge distillation; Mixup regularization

Yating Li, Wenwu He, Shuli Xing and Hengliang Zhu, “MH-LViT: Multi-path Hybrid Lightweight ViT Models with Enhancement Training” International Journal of Advanced Computer Science and Applications(IJACSA), 15(10), 2024. http://dx.doi.org/10.14569/IJACSA.2024.01510107

@article{Li2024,
title = {MH-LViT: Multi-path Hybrid Lightweight ViT Models with Enhancement Training},
journal = {International Journal of Advanced Computer Science and Applications},
doi = {10.14569/IJACSA.2024.01510107},
url = {http://dx.doi.org/10.14569/IJACSA.2024.01510107},
year = {2024},
publisher = {The Science and Information Organization},
volume = {15},
number = {10},
author = {Yating Li and Wenwu He and Shuli Xing and Hengliang Zhu}
}

Copyright Statement: This is an open access article licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, even commercially as long as the original work is properly cited.

MH-LViT: Multi-path Hybrid Lightweight ViT Models with Enhancement Training

Upcoming Conferences