Monitoring Student Attendance Through Vision Transformer-based Iris Recognition

—In the context of the ongoing digital transformation, the effective monitoring of student attendance holds paramount significance for educational establishments. This study presents an innovative approach using Vision Transformer technology for iris recognition to automate student attendance tracking. We fine-tuned Vision Transformer models, specifically ViT-B16, ViT-B32, ViT-L16, and ViT-L32, using the CASIA-Iris-Syn dataset and focused on overcoming challenges related to intra-class variation through data augmentation techniques, including rotation, shearing, and brightness adjustments. The results reveal that ViT-L16 is the most proficient, achieving an impressive accuracy of 95.69%. Comparative analysis with prior methodologies, specifically those employing Vision Transformer with Convolutional Neural Network, underscores the superiority of our proposed ViT-L16 model. This superiority is evident across various metrics, including accuracy, precision, recall, and F1 score. The experimental setup involves the use of Jupyter Notebook, Python technologies, TensorFlow, and Keras, emphasizing evaluations based on loss, accuracy, and Confusion Matrix. ViT-L16 consistently outshines other models, showcasing its resilience in iris recognition for student attendance. This research marks a significant step towards modernizing attendance systems, offering an accurate and automated solution suitable for the evolving needs of educational settings. Future work could explore integrating additional biometric modalities and refining Vision Transformer architecture for enhanced performance and broader application in educational environments.


INTRODUCTION
A Student Attendance System is a digital solution designed to track and manage the attendance of students in educational institutions.It offers an efficient and accurate way to record and monitor student attendance, replacing traditional manual methods.Additionally, applying Student Attendance System significantly enhance the organization, precision, and transparency within educational institutions.It can also serve as a tool for analyzing attendance patterns, identifying areas for improvement, and fostering communication between educators and parents.
Accurate and timely attendance records are fundamental for identifying student absenteeism, serving as a crucial component in promoting student retention and academic achievement [1].By meticulously tracking attendance, educators can readily implement necessary interventions for atrisk students, facilitating their academic success.
Many educational institutions still employ manual student attendance tracking, such as roll call or sign-in sheets.These methods are inefficient, as they are time-consuming and susceptible to human error.Additionally, they lack real-time capabilities, hindering the timely identification and resolution of attendance issues.Furthermore, manual methods offer limited security and privacy compared to biometric solutions.Vision Transformer (ViT) technology seeks to address these shortcomings by automating attendance, potentially reducing educator workload [2].ViTs leverage iris recognition, a biometric approach offering greater accuracy, security, and real-time monitoring than manual methods.This transformation can streamline administrative processes and align with the ongoing integration of technology within the educational sector.
Traditional attendance tracking solutions often rely on manual methods (roll calls, sign-in sheets), card-based systems (RFID [3] [4]), or biometric technologies [5] [6].While these methods offer varying degrees of utility, they frequently encounter limitations in efficiency, security, and accuracy.The ViT-based approach demonstrates a technological evolution in attendance management, leveraging sophisticated image processing and machine learning for iris recognition.This translates to superior security and precision, minimizing the potential for fraudulent activity.Furthermore, process automation makes the ViT approach a uniquely efficient and dependable solution within educational and organizational settings.Haut du formulaire.
Iris recognition technology is being integrated into student attendance systems to automate and enhance monitoring and documentation processes.This biometric approach involves capturing and analyzing the unique patterns within the iris (the colored ring surrounding the pupil) [7].Extracted features, including crypts, furrows, and freckles, serve as the basis for generating unique biometric templates for each individual.
Integrating AI, especially Vision Transformer technology, demonstrates a commitment to technological advancement within the educational institution.It positions the institution at the forefront of leveraging innovative solutions for routine tasks.
Addressing this challenge, we propose an innovative method for handling student attendance in educational institutions through the application of Computer Vision.Our strategy involves the detection and recognition of students' irises in classrooms utilizing a VIT.The primary focus of this paper is the creation of a transformer model designed www.ijacsa.thesai.orgspecifically for the identification and recognition of iris images.
The specific tasks were carried out according to the following steps:  Fine-tuning various Vision Transformer models to evaluate their performance in iris image classification.
 Utilizing a dataset from CASIA-Iris-Syn to assess the effectiveness of the proposed method.
 Evaluating the performance and accuracy of different Vision Transformer models, including ViT-B16, ViT-B32, ViT-L16, and ViT-L32, for the identification and classification of iris images.
 The results achieved demonstrated high performance, with an accuracy rate of 95.69% for iris image classification.
The subsequent sections of this paper are organized as follows: Section 2 offers a review of the existing studies correlated to iris image recognition in attendance systems and investigates ViT applications in image processing.Section 3 outlines the materials and methods utilized in the experimental approach.Moving forward to Section 4, the paper examines the results obtained and conducts a performance evaluation.Section 5 provides a comparative analysis of the proposed models.Lastly, Section 6 encapsulates the conclusions derived from this study.

II. RELATED WORK
Various studies have explored different methods for monitoring attendance, Okokpujie et al. [8] implemented a Student Attendance System that utilizes Iris Biometric Recognition.The experimental findings indicate that the system operates through a web-based platform.Student identification is achieved by comparing the acquired iris image with the database entries.The system assigns an integer value of (1) for a successful match and (0) for no match, with these outcomes are then stored in a MySQL-created database.Shaban et al. [9] proposed a multimodal system utilizing ear and iris biometrics at the feature fusion stage to recognize students in electronic examinations (E-exams) amid the COVID-19 pandemic.The approach attained a precision rate of 92.6%.
Hassan et al. [10] devised a technique for iris segmentation comprising two stages.Initially, it identifies the outer iris boundary, followed by the detection of the inner iris boundary in the second stage.The method underwent testing on CASIA iris image datasets V1 and V4, yielding accuracy results of 100% and 99.16% respectively.
Trabelsi & Shuaib, [11] proposed a biometric attendance system using fingerprint and iris recognition to improve accuracy and security in educational settings.This system addresses limitations of manual methods by offering reliable and efficient student identification, enhancing overall attendance recording processes.Similarly, Adamu, [12] introduced an advanced system integrating fingerprint and iris biometrics for attendance management in higher education.
This system replaces traditional methods with a secure, efficient, and accurate approach.Utilizing fingerprint and iris scanners at lecture entrances, it verifies student identities against stored biometric data, enabling real-time tracking and reporting.
Kadry & Smaili, [13] implemented a wireless attendance management system incorporating Daugman's algorithm (Daugman, 2003) for iris recognition.This biometrics-based system, integrated with wireless technology, addresses issues related to inaccurate attendance records and surpasses the challenges associated with establishing a dedicated network for this purpose.Khatun et al. [14] introduced the Iris Recognition Attendance Management System, which employs a camera to capture real-time images of the human iris, and storing this data in a database.The system utilizes the Gray-coding algorithm in MATLAB data analysis software to compute the iris radius.Employing MATLAB, it compares the radius of each individual with the previously stored value and automatically sends the attendance report to a predefined email address, eliminating the need for human intervention.Sujatha et al. [15] proposed a solution for a biometricbased attendance system utilizing iris recognition, interfaced with NI MYRIO.The proposal emphasizes the robustness of iris recognition, highlighting its reliability, accuracy, and efficiency attributed to the unique and immutable characteristics of the iris.Furthermore, NI LABVIEW, a graphical user interface-based software, facilitates real-time monitoring and attendance management.The integration of features such as SMS notifications for absentees and the generation of Excel sheets enhances the overall functionality of the system.
Joshy & Jalaja.[16] introduced a biometric authentication system based on the Internet of Things (IoT), and emphasizes the use of iris recognition for its unparalleled accuracy and security.The proposed system incorporates a hybrid encryption algorithm (Blowfish and RSA) for securing data transmitted over the Internet and implements a two-step authentication process.Developed as an embedded system for secure employee authentication.
Lad & More, [17] developed a student attendance system leveraging iris detection technology, which is acknowledged as the most reliable and accurate form of biometric identification.This initiative aims to address the shortcomings of commercial systems by offering an open-source alternative.The system employs the Hough transform for automatic iris segmentation, normalizes the iris region, and uses 1D Log-Gabor filters for feature extraction.These steps are designed to enhance the efficiency and accuracy of attendance tracking in educational contexts.
In [18], the authors presented a multimodal biometric system utilizing Convolutional Neural Networks (CNN) and transfer learning for iris recognition.It aims to overcome limitations in unimodal biometric methods by focusing on deep learning models for analyzing both left and right irises.Employing back-propagation with Adam's optimization, the system demonstrates high accuracy on public datasets, IITD www.ijacsa.thesai.organd CASIA-Iris-V3 Interval, achieving up to 99% accuracy.This study underscores the effectiveness of combining CNN characteristics and transfer learning in real-time iris recognition, enhancing security and identification processes in various conditions.
In recent studies, the Vision Transformer has been employed for image classification and identification, representing a neural network architecture tailored specifically for image processing in computer vision applications [19].The ViT is a neural network crafted for image processing in computer vision.It employs a self-attention mechanism commonly found in natural language processing, setting it apart from traditional image processing architectures like CNNs and RNNs.Introduced to address limitations in handling image data, ViT offers robust image feature representation and requires fewer computational resources for training compared to CNNs [20].
Elpina & Kusuma, [21]  Ha et al. [24] employed the Vision Transformer architecture to extract data features and categorize X-ray images as either pneumonia-positive or negative.Experimental findings reveal that the Vision Transformer algorithm consistently yields favorable classification outcomes, achieving an accuracy of approximately 94%.

A. Proposed Methodologies
This paper investigates the enhancement of student attendance management through a novel ViT-based iris recognition approach.This approach leverages automation to improve the accuracy and efficiency of attendance tracking.

B. Dataset
Our study utilized a dataset sourced from CASIA-Iris-Syn, featuring 8533 artificially generated iris images distributed across 50 classes, as illustrated in Fig. 2. The textures of these iris images were automatically synthesized from a subset of CASIA-IrisV1.Subsequently, the iris ring regions were incorporated into authentic iris images, augmenting the realism of the artificial iris images.Intra-class variations, including deformation, blurring, and rotation, were introduced into the synthesized iris dataset.The training dataset comprises 5814 iris images in JPG format.The validation and test sets were assessed using new databases, consisting of 1027 images and 1692 images, respectively.A graphical illustration of the configuration of this dataset as depicted in Fig. 3.The inspiration for this illustration is derived from [2].

C. Data Augmentation Technique
Data augmentation serves as a pivotal technique in machine learning, aimed at artificially expanding the scale of a training dataset through the application of diverse transformations to the existing data.This strategy proves instrumental in enhancing the generalization and resilience of machine learning models.Its significance becomes more pronounced when dealing with a restricted training dataset size.By creating novel variations to pre-existing data, data augmentation serves a dual purpose of mitigating overfitting risks and enabling the model to capture more resilient features.Widely utilized across www.ijacsa.thesai.orgdomains like image classification, object detection, and segmentation, data augmentation emerges as a fundamental tool for bolstering the performance and adaptability of machine learning models.
In this research, our emphasis was directed towards various data augmentation methods.The precise parameters selected for each operation are detailed in Table I.
The Table I outlines the specific parameters employed for various data augmentation operations.Rotation is set at 30 degrees, shearing at 0.2 radians, and zooming within a range of 0.2.Horizontal and vertical flips are enabled, and brightness is varied within the range of 0.4 to 1.5.These meticulously chosen parameters contribute to the augmentation of the training dataset, and ultimately bolstering the model's robustness and performance.

D. Vision Transformer (ViT)
The Vision Transformer presents a revolutionary deep learning architecture designed to tackle computer vision tasks, challenging the conventional prominence of convolutional neural networks.Originating from the paper titled "An Image is Worth 16x16 Words: Transformers for Image Recognition" by Alexey Dosovitskiy et al. [2], ViT extends the transformer architecture, initially crafted for natural language processing, into the realm of images.This adaptation involves the incorporation of self-attention mechanisms, enabling the model to adeptly capture long-range dependencies within the input data.ViT's introduction marks a paradigm shift, opening up new possibilities for image recognition and paving the way for diverse applications beyond the confines of traditional CNNbased approaches.
In contrast to processing the entire image in a holistic manner, the Vision Transformer adopts a strategy of partitioning the input image into fixed-size, non-overlapping patches.Subsequently, each of these patches undergoes a linear embedding, transforming it into a flat vector and composing the input sequence for the transformer.To preserve spatial information, positional embeddings are introduced to the patch embeddings.This addition enables the model to discern the spatial relationships existing between distinct patches, ensuring a nuanced understanding of the overall image structure.The incorporation of such mechanisms enhances ViT's capacity to effectively process and interpret intricate spatial features within images.
ViT models are typically pre-trained on large datasets, such as ImageNet, using a contrastive learning framework.This pretraining helps the model learn rich visual representations.The pre-trained ViT model is fine-tuned for specific tasks by adding a linear classification head on top.The model can be fine-tuned for various computer vision tasks such as image classification, object detection, and segmentation.
ViT has shown good scalability, performing well on both small and large datasets.This scalability is advantageous for adapting the model to different tasks.
In this research, the Vision Transformer architecture is crafted with adjustable dimensions to suit specific requirements.Additionally, each parameter in the vision transformer holds a crucial role, and their descriptions are outlined as follows:  image_size=224: This parameter defines the preferred dimensions (width and height) of the input images for the model.In this instance, the images are expected to have dimensions of 224x224 pixels.
 patch_size=16: The images undergo segmentation into smaller patches, and this parameter determines the size (width and height) of each patch.In this case, each patch measures 16x16 pixels.
 num_classes=50: This parameter signifies the number of classes involved in the classification task.In this particular example, the model is configured to categorize inputs into 50 classes.
 dropout=0.2:This parameter governs the dropout rate, a regularization technique employed to mitigate overfitting.It involves randomly setting a fraction of input units to 0 during training.

E. Evaluation Metrics
The assessment of prediction algorithms in this study relies on various performance metrics.The paper examines the subsequent evaluation metrics to gauge the efficacy of the proposed model: 1) Accuracy score: The accuracy score is a performance metric used to measure the overall correctness of a predictive model.It is calculated by dividing the number of correct predictions by the total number of predictions and is often expressed as a percentage [25].The formula for accuracy is shown in equation ( 1). (1) 2) Precision: Precision is a performance metric used in classification tasks to assess the accuracy of the positive predictions made by a model.It is defined as the ratio of true positive predictions to the total number of positive predictions (both true positives and false positives) [25].The formula for precision is shown in equation (2). ) (2) 3) Recall: Recall, also known as sensitivity or true positive rate, is a performance metric used in classification tasks to evaluate a model's ability to correctly identify all relevant instances of a particular class.It is the ratio of true www.ijacsa.thesai.orgpositive predictions to the total number of actual positive instances (including both true positives and false negatives) [25].The formula for recall is shown in equation ( 3): (3)

4) F1-score:
The F1 score is a metric commonly used in classification tasks that combines both precision and recall into a single measure.It is particularly useful when there is an uneven class distribution (imbalanced datasets) and provides a balance between the precision and recall metrics [25].The formula for the F1 score is shown in equation ( 4): (4)

5) Matthews correlation coefficient:
The Matthews correlation coefficient (MCC) is a metric used to evaluate the performance of binary classification models, particularly when dealing with imbalanced datasets.It takes into account true positives, true negatives, false positives, and false negatives.The formula for Matthews correlation coefficient is shown in equation ( 5):

IV. RESULTS AND DISCUSSIONS
The primary objective of this research is to develop a transformer model designed for the identification and recognition of iris images.The model underwent training using both regular and augmented images, with where data augmentation employed to enhance the training dataset.Training involved 5814 images, validation with 1027 images, and testing utilized 1692 images.The network layers were subsequently frozen and fine-tuned with dense layers containing 1024, 512, 256, and 50 neurons, respectively.

Table II summarizes the performance metrics of different
ViT models, each trained for 50 epochs, in the context of iris image identification and recognition.Notably, the ViT-L66 model emerges as the top performer with an accuracy score of 95.66%, demonstrating its effectiveness in accurately classifying iris images.ViT-L32, ViT-B32, and ViT-B16 also exhibit commendable performance, achieving accuracy scores of 94.30%, 93.26%, and 92.32% respectively.In terms of precision, recall, F1 score, and Matthews Correlation Coefficient (MCC), ViT-L16 consistently outperforms the other models, emphasizing its robustness in various evaluation criteria.These results indicate that the ViT-L16 model, with its customizable dimensions and advanced architecture, proves to be particularly effective in iris recognition tasks, demonstrating its potential for applications such as student attendance using Vision Transformer technology.

A. Experimental Setup
The experimental setup utilized Jupyter Notebook along with Python technologies such as NumPy, Pandas, and OpenCV for image processing tasks.For implementing classifiers, Scikit-Learn, Anaconda, and Python 3.9 were employed.The Vision Transformer model underwent training and testing processes using TensorFlow and Keras, leveraging Google Colab PRO T4-GPU with reported memory at 51GB and storage space at 166.77GB for refined computational capabilities.2) Accuracy: serves as a metric for the overall correctness of the model, determining the ratio of correctly predicted instances to the total instances.The goal in both training and testing phases is to maximize accuracy, as a higher value signifies a greater proportion of correct predictions.
In Fig. 4, the evaluation of loss and accuracy is depicted for the ViT-B16, ViT-B32, ViT-L16, and ViT-L32 models.The results clearly indicate that the ViT-L16 model exhibits superior performance, confirming its heightened effectiveness when compared to the other models.

C. Confusion Matrix
An additional evaluation metric, the Confusion Matrix, was utilized to assess the overall effectiveness of a classification model.The Confusion Matrix serves as a tabular summary, offering a detailed breakdown of the model's predictions in comparison to the actual class labels.The evaluation outcomes for the mentioned algorithms, using these criteria, are illustrated in Fig. 5. www.ijacsa.thesai.orgThe research team faced challenges during the fine-tuning process due to intra-class variations within the synthesized iris dataset.To address these limitations, data augmentation techniques such as deformation, blurring, and rotation were employed.These augmentations significantly enhanced the robustness of the Vision Transformer models, particularly the ViT-L16, by introducing artificial variations within the training data.This resulted in improved model performance on realworld iris patterns with diverse characteristics, leading to higher accuracy and reliability in iris recognition tasks.These findings demonstrate the effectiveness of data augmentation in mitigating the effects of intra-class variations and highlight the adaptability of Vision Transformer architectures for tasks like student attendance monitoring using iris recognition.This research explored the implementation of various Vision Transformer models (ViT-B16, ViT-B32, ViT-L16, and ViT-L32) for iris-based student attendance tracking.The ViT-L16 model demonstrated the highest performance in terms of accuracy, precision, and recall.Additionally, the study confirmed the adaptability of Vision Transformer architectures for iris recognition, underscoring the importance of data augmentation in improving model robustness.www.ijacsa.thesai.orgV. COMPARATIVE ANALYSIS Table III presents a comparative examination of outcomes derived from our approach, employing Vision Transformer models ViT-L16 and ViT-L32, juxtaposed with findings from a preceding investigation utilizing ViT with CNN.The assessment hinges on pivotal metrics, including accuracy, precision, recall, and F1 score, observed across a span of 50 epochs.
In the study delineated by [23], the ViT+CNN model attained an accuracy of 93.66% after 50 epochs, although explicit figures for precision, recall, and F1 score remain undisclosed.Our methodology, leveraging ViT-L16, outperformed these results, manifesting an elevated accuracy of 95.69%.Furthermore, precision, recall, and F1 score for ViT-L16 registered at 96.08%, 95.69%, and 95.64%, sequentially.This signifies an amelioration in our model's capacity to accurately discern and categorize instances.
The ViT-L16 model's exceptional performance likely stems from its architecture, which excels at processing global image featuresa vital aspect of accurate iris recognition.Unlike hybrid ViT+CNN models, which may introduce redundancies or inefficiencies through convolutional layers, the ViT-L16 relies exclusively on self-attention mechanisms.This enables a more direct and focused learning process that emphasizes the most pertinent features without the limitations inherent in convolutional operations.
Comparative results across the ViT-L16, ViT-L32, and ViT+CNN models demonstrate a clear pattern: the pure transformer-based models (ViT-L16, ViT-L32) consistently outperform the hybrid ViT+CNN model in accuracy, precision, recall, and F1 score.This finding suggests that self-attention mechanisms within transformers may be intrinsically better suited for iris recognition in attendance systems compared to a hybrid approach.Furthermore, these results highlight the potential of pure transformer models for driving improvements in biometric recognition systems.
introduced a Swin Transformer model for feature extraction in food image classification, incorporating an SVM classifier.The methodology underwent training and evaluation utilizing the Food-101 Dataset, resulting in an impressive accuracy (ACC) of 97.61%.Mehta et al.[22] introduced a method for ear recognition that exercises the ViT network architecture, attaining a recognition accuracy surpassing 99.36%.Latif et al.[23] introduced a hybrid model combining ViT and Convolutional Neural Network (CNN) for the identification and verification of iris images.The hybrid model demonstrated an accuracy of up to 93.66% in recognizing iris patterns.

Fig. 1
Fig. 1 illustrates our proposed Vision Transformers approach for iris identification and recognition.

Fig. 1 .
Fig. 1.Visualization of our proposed Vision Transformer (ViT) model for identifying and recognizing iris images.Initially, the input image undergoes segmentation into fixed-size patches, which are subsequently flattened.Following this, position embeddings are introduced, and the resulting sequence of vectors is then passed through a standard Transformer encoder.The inspiration for this illustration is derived from[2].

Fig. 5 (
Fig. 5(a) presented the recognition outcomes achieved using the ViT-B16 model.The average count of accurate recognitions for each category was 31.16.The expected correct recognition count ranged between 33 and 34.As a result, the average accuracy attained by the ViT-B16 model was 92.02%.In Fig. 5(b), the ViT-B32 model's recognition results were presented.The mean correct recognition count for each category was 31.56, with the expected correct recognition count ranging 33 and 34.As a result, the average accuracy of the ViT-B32 model was 93.26%.Moving to Fig. 5(c), it demonstrated the recognition outcomes of the ViT-L16 model.The mean correct recognition number for each category was 32.44, with the expected correct recognition number ranging between 33 and 34.The ViT-L16 model achieved an average accuracy of 95.69%.In Fig.5(d), the recognition results of the ViT-L32 model were depicted.The mean correct recognition number for each category was 31.82, and the expected correct recognition number ranged between 30 and 34.The average accuracy of the ViT-L32 model was 94.03%.D. Classification ReportFig.6(a)displays the classification report for the ViT-B16 model, indicating precision values for iris classes ranging from 0.47 to 1. Additionally, the recall performance values for iris classes fall within the range of 0.35 to 1, with corresponding support values between 33 and 34.F1 scores for the iris classes vary from 0.52 to 1.The ViT-B16 model achieves an accuracy of 0.92 (92%) based on the F1 score, considering 1692 support values.The macro and weighted averages for precision and recall are 0.94, 0.92, 0.94, and 0.92, and the f1 scores are 0.92 and 0.92, each with support values of 1692.Fig. 6(b) exhibits the classification report of the ViT-B32 model, revealing precision values within the range of 0.73 to 1

TABLE II .
ACCURACY SCORE, PRECISION SCORE, RECALL SCORE, F1 SCORE AND MCC OF OUR VISION TRANSFORMERS MODELS Loss: serves as an indicator of the model's performance on training data, gauging the discrepancy between predicted values and actual ground truth.The training objective involves minimizing the loss, with a lower value indicating closer alignment between model predictions and actual values.

TABLE III .
COMPARISON OF RESULTS WITH PREVIOUS WORKS