Automated Paper-based Multiple Choice Scoring Framework using Fast Object Detection Algorithm

—Optical mark reader (OMR) technology is an important research topic in artificial intelligence, with a wide range of applications such as text processing, document recognition, surveying, statistics, and process automation. Researchers have proposed many methods employing either traditional image processing and statistics or complex machine learning models. This paper presents a feasible solution for the OMR problem. It uses a fast object detection model to detect markers effectively and then segment the answer sheet into smaller regions for the mark reader model to recognize the user’s selections accurately. The experimental results on actual answer sheets from college exams show that the error is less than 0.5 percent, and the processing speed can achieve up to 50 answer sheets per minute on standard core i5 personal computers.


I. INTRODUCTION
In the era of digitalization and automation, the education sector has attracted significant attention due to its potential to revolutionize traditional educational methods by incorporating cutting-edge technologies to improve the quality of education and academic management.In teaching, evaluating learning progress and the assessment of the learners is very important.Automating this process by applying technology such as an Optical Mark Reader (OMR) attracts the attention of many researchers and organizations.OMR technology has become essential for the automatic multiple choice scoring system, especially in large-scale competitions.
OMR is now widely used for exams or surveys with multiple choice answers [1].According to Zhang et al. [2], this is the most common type of exercise used in education.This technology focuses on rapidly detecting data extracted from filled-in forms created with a pencil or pen.OMR technology involves the use of Multiple Choice Questions (MCQs) in exams, which allows quick results for students, serves as a tool for teachers and educational institutions to apply in their exams, reduces the need for manual labor, and improves performance.OMR initially appeared as a dedicated hardware solution [3,4,5,6] or using paid resources [7,8,9,10].These approaches have often been studied before.But then, software solutions [11,12,13,14,15] appeared along with the development of technology, gradually replacing specialized hardware devices.OMR approaches can be divided into two main categories: Using conventional image processing [4,16,17,18], and using artificial intelligent machine learning [11,19,20].In conventional image processing approaches, first, they adjust the orientation of the input image [5,11,20], then apply the segmentation techniques to search for areas that need identification [5,14].After that, they detect whether the answer area is circled based on the grayscale level [21,22] or the number of pixels in the area [18,23,24].These approaches are easy to build and have a short implementation and runtime.However, they may need to fully capture the complex attributes and variations of each specific test, leading to low accuracy.
The limitation of conventional methods has led to growing interest in deep learning methods, especially the convolutional neural networks (CNN), which have demonstrated superior image processing and recognition capabilities.Deep learning offers the potential for many research fields such as image and signal processing [25,26].It provides more accurate and robust OMR systems capable of handling diverse types of tests.In addition to the processing algorithms used in the pure image processing approach, this method builds a neural network suitable for the problem.The classification techniques [13,19,23,27] are commonly used.This technique can accurately and quickly identify an answer box to identify whether an answer is selected.
In addition, the input images may come from many sources like cameras, webcams [2,28] or from smartphones [16,19,20,29]; this factor also dramatically affects construction costs and model implementation time.Models using many image formats and sources will save time and effort and reach more users.
Several methods of deploying the system into software [4,12] on desktop or mobile devices have built a relatively complete system.The benefit of this is that it can be used flexibly in many places and has high practical applications.These systems often require users to print or create exam papers using predefined software [9].However, this must ensure excellent and stable performance because it is difficult to maintain, modify, and add features.
According to Sumit Tiwari and colleagues [30], manipulation of OMR board data is shared and affects exams nowadays.This form of data tampering has not been taken into account by existing systems.This article aims to use an algorithm to encode the characteristics of the answers and information students have highlighted in the answer sheet.Then, create a QR code and use that QR code to evaluate whether the exam paper is fake or not.A novel method that achieves successful research results can be applied in practice.This article addresses the above research limitations by proposing a deep learning method based on the YOLO (You Only Look Once) algorithm to score multiple choice tests accurately.We use YOLOv8 because it stands out as the fastest model with lower parameters compared to the other versions [31].This study uses a data set of real-life multiple choice test sets and training and testing processes to create a powerful and effective model.The contributions of the research include flexible use of input images, low implementation costs, and high accuracy requirements, which is important to propose a fast, easy-to-use method with the use of an optimal resource.The rest of the article is presented as follows: Section II offers the proposed system architecture and detailed algorithm implementation.Section III presents the test and the results evaluations.The final section concludes the article with the future direction of the system given in Section IV.

A. Answer-sheet design
The answer sheets given to students in each exam are designed as shown in Fig. 1.The above form was redesigned from the answer sheet in Vietnam's national high school exam.The answer sheet has the student's registration number, exam code, and exam-class sections to get information about students and exam questions, making the process of statistics and data processing easier when the data set is large.In addition, the answer section has a maximum of 60 questions, and you can optionally score specific questions, which is suitable for multiple-choice exams.

B. Overall System Implementation
In Fig. 2, we build a general system diagram for the system analyzed in the previous section.This diagram can be used to understand how the system's components interact with each other, as well as provide an overview of the system's architecture and functionality.
We propose to divide the method into phases: Segment and preprocess data phase, Labeling phase, Training phase, and Online Recognition phase.These stages are presented in detail in the following sections.

C. Segmentation and Pre-processing Phases
Before entering recognition, to achieve a balanced accuracy and performance, YOLO recommends that the model's input image be sized 640x640.The model will even resize the input image to have the most significant side size set to 640 and maintain the original aspect ratio.Because of this, if the image is not segmented into small parts, small details will be lost when the input image is trained.Therefore, we improved the accuracy by focusing on the desired portion of the original input image (segmenting the portion and keep the original resolution).
We can find the constant lines that surround the blocks and shapes to segment the input image into student's information section and the question answer choice section as showed in Fig. 3: Segmented image components after cropping will be resized prior to recognition.Here, we will resize to have the most significant edge size set to 640 and keep the same proportions.
The training answer sheets are divided into two sets: training and validation with 85% and 15%, respectively.

D. Labeling phase
After having preprocessed data, we build a labeling process for the model.Labeling is defining bounding boxes around class types in the image and placing captions for each box.The model can be trained to detect and classify classes in the following training phase by accurately locating the courses in the photo.Here, we label each cropped image with the LabelImg software.

1) Marker:
In reality, the input images can be skewed, rotated, etc.The coordinates of the markers that help us specifically handle these problems will be presented in the next part.We placed three markers in three corners: Top left, top right, and bottom left of the exam paper with the same shape and labeled them as "marker1".The marker in the lower right corner using other shape type and labeled as 'marker2' shown in Fig. 4.
2) Question-Answer Section: In this section, each question will have four answer options; each question can have many correct answers, so we will have 2 4 cases where the answer is selected.Therefore, we use 16 labels, encoded in bits 0 and 1, shown in Table I and labeled as in Fig. 5.
3) Student Information Section: Student information includes the exam class code, student registration number, and exam code.These fields are identified by integer numbers from 0 to 9. Therefore, we use 10 labels, shown in Table II and labeled as in Fig. 6.

E. Training phase
During the training process, there are several main steps as the following description:      The performance and accuracy of the model are evaluated through commonly used parameters in Machine Learning in general and object recognition problems in general: Precision, recall, mean precision (AP) and mean average precision (mAP) [32]: In the Formula 1: • T P : Number of cases correctly predicted as Positive.
• F P : Number of cases predicted to be Positive but actually Negative.
• F N : Number of cases predicted to be Negative but actually Positive.
From Precision and Recall, we calculate the average accuracy of the object detection model: Thence inferred: In the Formulas 2 and 3: • h is the number of classes • Recalls(i) and P recisions(i) are the value of the i th element of the Recalls and Precisions array • AP k is the AP value of the i th class

F. Online Recognition Phase
The input image is taken directly from the camera or smartphone, so it is impossible to avoid the cases where the input image has different angles and distances from the camera to the answer sheet or cases where the input image is blurry, incorrect and misaligned.This stage's purpose is to process the image to extract the part of the image that only contains multiple-choice answer sheets.The test paper must be aligned in the most appropriate direction, brightness, and color to be included in the identification model.
First, predict the input image, the target to identify four markers, and we get position marker.After the YOLO model recognized four markers (3 square markers and one circle marker), we got the coordinates of the four markers on the original exam paper.Note that the order of the detected angles is unconventional.Because of the above reason, we need to rearrange the four corners in the correct order.Based on the Position Marker (PM), we determine the direction of the image by placing three markers1 in positions: top-left, topright, bottom-left, and marker 2 in the bottom-right position.Therefore, to retrieve the part of the image that only contains multiple-choice answer sheets, we rotate the image so that marker 2 is always in the bottom-right position.
Call the top left point P 1 , the top right point P 2 , the bottom right point P 3 , the bottom left point P 4 .Suppose that each point is defined by the coordinates (x,y): Rotating the input image with angle α, we obtain a new image rotated in the correct direction.After corner points have been identified and target points have been calculated, we use the getPerspectiveTransform and warpPerspective functions to transform and align the input image.This image processing step, which includes markers, produces a transformed image that only contains the answer sheet.Additionally, the image is correctly rotated.
We will obtain an image that only includes the answer sheet and has been transformed accurately.We tested our model on a variety of image samples, including different orientations and lighting conditions, to obtain an image that only contains the answer sheet.The model still works well in most cases.The algorithm used in this procedure is shown in Algorithm 1. Get the background removed image from the DC: image_extracted.Using getPerspectiveTransform and warpPerspective 12: return image_preprocessed = image_extracted 13: else: 14: break 15: end Fig. 8 shows the final result of Algorithm 1.With this processed image, the following segmentation and identification of each component is much easier.
Get the image after preprocessing, crop the image to get: column answer image and student information image, recognition these images.Based on recognition results, we can extract the information of students and the answers from each answer sheet of the exam.Subsequently, comparisons with the correct answers associated with each exam class code were made, allowing each candidate to receive an automated scoring process.In addition, a threshold coefficient called θ was introduced.This threshold is the decisive parameter that determines the confidence level required to consider a prediction to be "correct".If the confidence exceeds the specified threshold, the result of the recognition will be confirmed as accurate.On the contrary, if the confidence falls below the threshold, the prediction is considered wrong.The algorithm below describes the systematic identification and scoring process: Insert to database 8: Write mark 9: else 10: Issue a warning 11: end:

III. EXPERIMENTAL RESULTS
A set of experiments was performed to evaluate the accuracy of the methods presented and the automated scoring systems.With an algorithm written in Python, the automatic scoring system is evaluated on a laptop running on an Intel Core i5 11 th processor with 16GB RAM, recognizing input images including three parts that need to be recognized: marker, information, student, and answer information.The system will take a variable amount of time for the recognition process proportional to the number of questions on the answer sheet.When identifying votes with fewer answers, it will take less time.We experimented and calculated that the average time to recognize a 60-question answer sheet is 1.2 seconds.
The model is used to predict results for new tests.These results are transmitted to the scoring system to produce the final results shown in Fig. 9: After testing and refining the model and continuing to train, our team achieved the following results after training in Fig. 10: The Confusion Matrix chart shows the confusion between classes in the entire system.It can be seen that the confusion model is very little, shown in points other than the main diagonal (representing noise) and mainly confusion.between the    The test set contains new tests that are not used to train the model.Based on Formula 3 and the results after training, the system can evaluate the performance of a model in a new test and calculate the parameters as described in Table IV: After completing the training phase, analyzing the input image and applying the recognition process, rectangles will be drawn on points that the model recognizes: around markers, student information and selected sentences (fill).The results obtained are shown in Fig. 12.To get an accurate experiment, we will use a data set such that input images are from different angles and resolutions, taken from many types of devices with diverse light intensities.In addition, on each of those answer sheets, the number of answers varies, with multiple answer choices.To make the test data set diverse, accurate, and most importantly, "realistic," we used this system to automatically score 300 real-life multiplechoice tests with the semester's final exam at Hanoi University of Science and Technology.
Choosing the threshold value is an integral part of evaluating model performance.The input image is likely wrong when one of the detected model objects has a confidence level below the Threshold threshold that we have previously chosen.Perform the first rough tuning experiment: Choose 0.6 ≤ θ ≤ 0.9, each increment of 0.05 each (see Fig. 13).According to the results of the diagram above, perform the second experiment for fine tuning (see Fig. 14): Choose 0.75 ≤ θ ≤ 0.85, each mark is 0.01 apart.The purpose is to choose the most suitable θ value: Our system shows high accuracy when input images are of good quality, such as answer sheets taken in suitable, flat, and transparent lighting conditions.However, the system needs better-quality input images.For example, when the answer sheets are blurry, uneven, or have a lot of extra lines due to uneven scanning, the model needs help identifying the answers and information from the student.Especially when the input image lacks corners, markers are lost, leading to the image preprocessing process being unable to process.The accuracy of the system can be significantly reduced if necessary information is lost or contaminated.Based on the analysis results of Fig. 14, we decided to choose θ = 0.79 -a value large enough to have the lowest probability of errors occurring in the data set.

IV. CONCLUSION
This paper presented an automated paper-based multiple choice grading system with the ulitization of using fast object detection algorithm.Research and experimental results on actual college exams have shown that employing YOLOv8 model together with pre-processing techniques improved the performance of the OMR system with the error rate less than 0.5% and processing time on stand personal computer around 1 second.This can help educational and assessment organizations perform test administration tasks more effectively.The article also highlights challenges and potential development directions.Integrating the system into real-world applications and improving the real-time application ability are also challenges worth considering in the future.

Fig. 7 ′ 4 = x P1 − d 1 y P ′ 4 = y P3 ( 5 )From the coordinates P ′ 4 ,
Fig. 7 illustrates the first image of the process.After obtaining the position of marker 2 through identification, we consider this position to be the new bottom-right position.Then rotate the remaining corners according to this marker2 position.P ′ 4 is the new bottom-left position.In fact, P ′ 4 can be in many places around P 3 .We need to determine the rotation angle α, wherever P ′ 4 is.First, determine the coordinates P 4 among the 3 marker1 coordinates.Because P 4 is considered a bottom-left point, based on the distance, we determine P 4 is the point with the shortest distance to P 3 , specifically d 1 < d 3 < d 2 : P 4 = {(x i , y i ) ⊂ P M |d P 4 = min {d 1 , d 2 , d 3 }}.In the next step, determine the coordinates P ′ 4 , P ′ 4 at the new bottom-left position, with a distance equal to P 4 to P 3 , so the coordinates of P ′ 4 are always equal to: x P ′ 4 = x P1 − d 1 y P ′ 4 = y P3(5)
mAP50 plots: This chart depicts the average accuracy; the model achieves a high value (approaching 1), showing that the model achieves high accuracy in object recognition; mAP50-95 chart If the model reaches a high value, it shows that the model can recognize objects well on many different levels of probability threshold.By monitoring these graphs and continuously improving the model based on the insights we gain, we can train the YOLOv8 model to accurately detect, locate, and classify classes for multiple choice exams.

TABLE I .
LABEL ANSWER Fig. 6.Labeling student information, model.Metrics such as loss, mAP are monitored to assess the model.These parameters are explicitly described in TableIII.

TABLE III .
CUSTOM TRAINING MODEL

TABLE II
After completing model training, the next step is to test and fine-tune the model.This step is done to ensure that the model can perform well in new tests and is accurate in different types of questions.

TABLE IV .
MEAN AVERAGE PRECISION