Pilot Study: The Use of Electroencephalogram to Measure Attentiveness towards Short Training Videos

—Universities, schools, and training centers are seeking to improve their computer-based [3] and distance learning classes through the addition of short training videos, often referred to as podcasts [4]. As distance learning and computer based training become more popular, it is of great interest to measure if students are attentive to recorded lessons and short training videos. The proposed research presents a novel approach to this issue. Signal processing of electroencephalogram (EEG) has proven useful in measuring attentiveness in a variety of applications such as vehicle operation and listening to sonar [5] have shown that EEG data can be correlated to the ability of participants to remember television commercials days after they have seen them [16]. Electrical engineering presents a possible solution with recent advances in the use of biometric signal analysis for the detection of affective (emotional) response [17] Despite the wealth of literature on the use of EEG to determine attentiveness in a variety of applications, the use of EEG for the detection of attentiveness towards short training videos has not been studied, nor is there a great deal of consistency with regard to specific methods that would imply a single method for this new application. Indeed, there is great variety in EEG signal processing and machine learning methods described in the literature cited above and in other literature [28] [29] [30] [31] [32] [33] [34]. This paper presents a novel method which uses EEG as an input to an automated system that measures a participant's attentiveness while watching a short training video. This paper provides the results of a pilot study, including a structured comparison of signal processing and machine learning methods to find optimal solutions which can be extended to other applications.


I. INTRODUCTION
For the purposes of this experiment, the affective state of the participant watching the short training video is deemed to be attentive when the participant has a high positive affect, including feelings of satisfaction, engagement, interest, and involvement.This definition of short training video attentiveness is not the only possible definition, but it is justified by related research in similar applications.
In [20] the authors propose a brief and easy to administer mood scale called the Positive and Negative Affect Schedule (PANAS), where the orthogonal axis of emotion are "negative affect" and "positive affect" and high positive affect is associated with terms such as high energy, full concentration, and pleasurable engagement.
In [21] affect is explored not from the point of view of the learner having an emotional response, but rather a computerized tutoring system having an avatar that uses facial expressions and body movements to communicate an affective state to the learner.With regard to the range of emotions expressed by the avatar, the researchers explain "…because their role is primarily to facilitate positive learning experiences, only a critical subset of the full range of emotive expression is useful for pedagogical agents.
For example, they should be able to exhibit body language that expresses joy and excitement when learners do well, inquisitiveness for uncertain situations (such as when rhetorical questions are posed), and disappointment when problemsolving progress is less than optimal.
In [22] the authors defines a model of emotions and learning that explains the emotional state of the learner as being in one of four quadrants, depending on the positive or negative value on two axes.The first axis is "learning," and the second axis is "affect."The positive side of the affect axis is associated with terms such as awe, satisfaction, curiosity, and hopefulness.
In [32] the researchers use machine learning to predict a participant's self-assessed emotion when presented with the results of a quiz and personality test.The participant finds out the results, and then expresses their emotions by selecting one of the following words: disappointment, distress, joy, relief, satisfaction or fear-confirmed.
In [26] researchers analyze the affective state of participants while they are performing a learning task with a computerized natural language tutor using both online and offline self-reports by participants as well as peers and trained judges observing facial features.Words that the study used in both self-reports and reports by observers are defined in TABLE I.
It is useful to note that this study found the inter-rater reliability coefficient, Kappa, of trained judges who watch the facial expressions of the participants and code facial actions to provide affective judgments was only 0.36.Although Kappa is a conservative measure of agreement, it still shows that emotions are tough to judge even by those who are trained to do so.www.ijacsa.thesai.orgThe Mathematical definition of Kappa is shown in Equation (1) where Pr(a) is the relative observed agreement among raters, and Pr(e) is the hypothetical probability of chance agreement, using the observed data to calculate the probabilities of each observer randomly saying each category.

 
If the raters are in complete agreement then κ = 1.If there is no agreement among the raters other than what would be expected by chance, as defined by Pr(e), κ = 0.
In [23] the researchers attempt to dive deeper into the affective state of the learner by gathering, in their own words, the definition of what they mean by "engagement in courses" and "What makes a course engaging to you?"The learners in this study were male and female freshman and sophomores in engineering, and the researchers were able to group the responses into categories, as well as give specific wording terms that the students used.Some notable popular categories and terms used include: active participation, hands on, faculty enthusiasm and interest, discussion, interaction between faculty and students, as well as other similar terms, and other less popular terms.
So for the purposes of this pilot study, the definition of attentiveness is when the participant has a high positive affect towards a short training video.

II. EXPERIMENTAL SETUP
Recent research has demonstrated the use of a single dry frontal electrode for the capture of EEG in similar applications.For this research experiment, the MindWave Mobile device from NeuroSky was used.
In [28], the same make and model of EEG data collection device, showed a positive correlation between EEG data collected and participant self-reported attention during an experiment where the participant interacts with a computer generated three-dimensional avatar.Also using the same make and model of EEG data collection equipment is [17], where 78% of the time the brain wave data accurately identified the participant's selfcategorization of affective state during an experiment which presented increasingly stressful mental challenges to the participant.[27] also used the same device to predict with accuracy levels significantly better than chance whether or not the participant was reading an easy or a hard sentence.Finally, the presentation related to [36] indicated that the correlation analysis could just as effectively use only the frontal contact.
In [25], the participant's affective state while using an Intelligent Tutoring System (ITS) is measured using a single frontal lobe EEG electrode with earlobe grounds showing an accurate detection (82%) of affective state from the brainwaves.[25] was based on earlier work [31] which also used the same EEG sensor setup and revealed an ability to use brainwaves to accurately predict participant self-assessed emotion according to the Self-Assessment Manikin (SAM) scale.[33] also uses only a frontal electrode (two electrodes are used, however the difference in voltage between the two is used as the only single EEG data stream over time) to achieve an 85.7% accuracy in identifying if the participant is watching an interesting video versus an uninteresting video.In [35], the single electrode is placed at F3 or F4 alternatively in an experiment that also collected facial expressions when watching emotion inducing videos, revealing that noisy correlated EEG-facial expression data could also be correctly recovered to as high in 95.2% cases.

A. Watching Short Training Videos
Participants are fitted for the dry contact EEG, sitting in front of the data collection and video playback laptop as seen in Figure 1.The headset is adjusted until a strong signal is generated without interruption, and blinks can be seen on the EEG graph displayed in real-time.
Once the setup is operational, the recording is started, and the video playback begins.The video starts with a relaxation segment that begins with a text message as in Figure 2. www.ijacsa.thesai.orgAfter 2.5 minutes of relaxing empty-beach photos with the sound of waves and seagulls in the background, the short training video plays for an additional 2.5 minutes.The training videos explain a hypothetical technical lesson regarding an imaginary device with components names such as "Alpha Module" and "Beta Module".The participant must learn how to troubleshoot the device and decide whether or not to replace a component.The videos include enhanced podcast (essentially a PowerPoint slide show that is narrated) as well as worked example style training that asks the participant to act along with the still pictures and live video of the trainer.The participant is encouraged to act along with the instructor in the video.That acting may be to hold up their hand, or to grasp their thumb or grasp their index and middle finger.This is done as a memory aid because the fictional device purposely looks like the fingers of a hand.One scene is shown in Figure 3.

B. Data Collection
EEG signal is collected at 512 Hz in 1 second frames which can be concatenated to any length.To test some of the procedures and mathematical algorithms prior to the proposed research study, a set of seven participants were asked to participate, and detailed notes were taken during this testing by an observer who also sat in the room.After the videos were done, the participant removed the EEG headset and answers a second questionnaire, including a set of closed ended questions (yes or no) and a multiple choice quiz relating to the lesson provided.
Once the EEG data is collected, the data is segmented into areas where the participant is expected to be inattentive (during the beach pictures) and where the participant is expected to be attentive (during the short training video).The goal is to collect exemplary data of attentive and inattentive epochs gathered from participants observing videos that are purposely designed to induce an inattentive or attentive state.Once the system is trained, it can then be used as an automated detection mechanism for attentiveness, because it has learned by example, what the EEG data contains in these situations.Even if the participants are not perfectly attentive for the entire duration of the training video, they are attentive for the majority of the time, and outlier data points will be in the minority.
For the Pilot study, each participant has 150 seconds of inattentive EEG data (watching the beach photos), and 150 seconds of attentive EEG data (watching the training video), for a total of 300 seconds (5 minutes) of EEG data.Since data collection is in 1-second epochs, a total of 2100 sample epochs are collected across 7 participants.

III. ANALYSIS OF EEG DATA
Analysis consists of windowing, transformation, data reduction, and machine learning and pattern recognition.

A. Windowing
Non-overlapping 512 sample one-second windows are used.While many possible windowing functions are available this Pilot Study uses the Hanning window shown in Equation ( 2) which reduces ripple in the frequency power spectrum.

B. Transformation
After windowing is complete, the transformation of the data is required.This is done to extract important features that will be used by machine learning, and also in large part to remove instance by instance timing variations that are not important to the problem at hand.

 
Note that the sequence of values |Xk| are symmetrical about N/2 as shown in Equation (6).www.ijacsa.thesai.org

 
Also used for comparison in the Pilot study is the DFT as shown in equations ( 7) and (8) where the original windowed signal x'n is transformed into two signals, each of half the duration (down sampled by a factor of two) called Xappn and Xdetn, the approximation coefficients and the detail coefficients, respectively.The signals hn and gn are a specially matched set of filters, low-pass and high-pass respectively (called Mallat wavelet decomposition), which are generated from the desired wavelet function.

   
DWT can sometimes be a preferred transformation over DFT, because it preserves the sequencing of time varying events, and has greater resolution with regard to the timing of signal events (rather than only phase information).DFT can be made to approximate DWT benefits by using smaller window size with a tradeoff in low frequency resolution, and variations on such Short Time Fourier Transform (STFT) including window overlay and window averaging appear in literature.Similarly, a limitation of DWT is that it provides scale information, but not specifically frequency information.Here too, techniques described in literature show a means to extract frequency info from DWT based on the down sampling nature of the algorithm.For example, the selection of a window size that is a power of 2 (such as 512 samples used in the Pilot study) allows repeated iterations of the wavelet transform to focus on different frequency bands.
DWT decompositions can be repeated again and again, saving the detail coefficients, and then taking the approximation coefficients and passing them through Equations ( 7) and ( 8) again.Each such repetition is called a Level of DWT, so the first level is on the original signal using a window size of N. The second iteration is Level 2, using a window size of N/2.The lth iteration is Level l, using a window size of N/2l.Equations ( 9) and ( 10) describe how the detail coefficients Xdetn of the Level l DWT decomposition represent a subset of the frequencies contained within the original signal.FNyquist is the Nyquist frequency of the sampled signal, and the range of frequencies in the detail coefficients range from a low of Fdet_L to a high of Fdet_H.

   
In the Pilot study, windows of 512 samples are used, representing one-second of collected data (512 Hz sampling rate).The Nyquist frequency, FNyquist, is therefore half the sampling rate (256 Hz) and represents the highest frequency represented in the sample.Any frequency higher than 256 Hz that appears in the signal prior to sampling (analog to digital conversion) will result in aliasing, and so must be removed prior to sampling (using hardware filters in the EEG data collection device).In the Pilot study, a 6th level DWT decomposition was use employing the Daubechies db1 wavelet (equivalently the Haar wavelet, which is a step function with values 1 and -1 of equal duration).This was repeated six times, and the equivalent frequency ranges for each level are shown in Error!Reference source not found.The Pilot study compares both DFT and DWT algorithms.

C. Data Reduction
Data reduction is needed to speed the machine learning algorithms, and to reduce the issues of high dimension data causes machine learning algorithms to over-fit on the training data.This causes them to have reduced ability for generalization (to correctly categorize data that is not part of the training set).This is sometimes called the curse of dimensionality.
Data reduction can take place using transformationindependent algorithms such as Principal Component Analysis or Fischer Discriminant analysis.These algorithms seek to find an optimal linear transformation that reduces the number of dimensions while keeping the data as spread out as possible, thereby keeping the dimensional information most useful to distance metric machine learning algorithms.Data reduction techniques can also be specific to the transformation algorithm, using the knowledge of the algorithm to find an optimal data reduction scheme.
Data reduction of DFT is described in literature by throwing away frequency information, or by banding adjacent frequency power values by averaging them into a single bandpower value.The Pilot study compares different data reduction algorithms.For the DFT, frequency data is discarded at the high end in increments of the powers of 2, to compare machine learning accuracy with reduced data sets.So the Pilot study examines presenting the machine learning algorithm with data sets having dimensionality 256, 128, 64, 32, 16, 8, 4, and 2.
Data reduction of DWT is described in literature as grouping the various detail coefficients and the approximation coefficients into one or more descriptive attributes regarding that set of data.For example, literature describes taking a set of detail coefficients and combining them into a set of parameters including Energy, Power, Median, Entropy, Mean, Min, Max, www.ijacsa.thesai.organd Slope.For the Pilot study, the DWT is examined at each of 6 levels of decomposition, where at each level the detail coefficients are used to calculate the aforementioned eight parameters, and at the 6th level of decomposition, the approximation coefficients are also used to calculate these parameters.So the Pilot study uses 7 sets of coefficients each yielding 8 parameters giving the data sets the dimensionality of 8x7 or 56.

D. Machine Learning and Pattern Recognition
The purpose of N-Fold cross validation is to put the problem solution through rigorous quality assurance using the data that is available.Nested N-fold cross validation adds the additional step of changing the EEG signal analysis parameters.So for example, if 10 different possible ways of performing EEG signal analysis are to be compared using 10-fold cross validation, then 100 different tests will be performed.Note that in the Pilot study, for each coefficient set, we calculate 8 parameters for each coefficient set.Just as in Figure 4 -the same set of machine learning algorithms are compared.

IV. CONCLUSIONS AND FUTURE WORK
The results and review encourage the author to do more investigation in this area.The conclusions reached from the pilot study is that some future work on improvements is needed, which should not be very difficult to accomplish in order to achieve the goal of solving the problem of signal processing of EEG for the detection of attentiveness under the constraints of watching short training videos.

A. Experimental Setup
There were a number of improvements that will be needed on the experimental setup.These improvements were discovered as part of conducting the trial run of the experiments, and by having the educational podcasts (short training videos) reviewed by an expert in the field.The improvements planned include:  Beach scene may be made to induce further inattentiveness (more boring).May use single candle burning or other simpler video without audio.
 Improve the training video based on feedback from the professional review.Have the videos instruct the participant to complete simple tasks that allow external measurement if the participant is on task or not.
 Ask participants to go to the bathroom first.
 Insist that cellphones have to be turned off.
 Find a more quiet room to do experiment -occasional voices through the walls, and slamming doors outside distracting.
 Rescale the "age" question.Negative reaction to having to check "45+" The post-experiment questions will be improved to include more data collected.Open-ended questions will be added.This will allow the proposed research, which is basic research and intended to produce an automated system, also collect data for applied research, such as may research that employs the final method and apparatus under direction of an educational www.ijacsa.thesai.orgpsychologist.Open ended questions may include formations such as "tell me what you were thinking when…?" and "what were you feeling when…?"Data collection improvements are also needed, with some ideas including:  Use a fresh AAA battery before every experiment or two, as the wireless EEG headset loses connection after 2 or 3 uses.
 Have a more precise time stamping method for synchronization of video playback and start of EEG collection (because sometimes the PC is slow to start up the video playback).Perhaps a stopwatch with lap timer separate from the experiment laptop, so it is unaffected by CPU usage dependency that might delay the recording of when the operator clicks a button.
 Even though outside distractions will be minimized as per the above, there should still be a time stamping method of recording precisely when observed unplanned events take place (a cough, the time when a participant sighs or lifts their hand, etc.)

B. Signal Processing
There were some improvements needed for the proposed research that were discovered during the Pilot study.These include the application of Principal Component Analysis (PCA) and other methods of data reduction to ensure the maximum amount of useful information is obtained.Also, it may be beneficial to eliminate noise such as muscle and eye movement prior to feature extraction.Based on test results, the most suitable methods will be selected in terms of both performance and speed.
Finally, the Pilot study confirms the need for a structured and orderly comparison using a nested n-fold cross validation in which all such options are tested, compared and optimized.

C. Other Future Work
A few of the references in this paper are older.These are wonderful foundational documents; however additional reference search will look for advances which can be helpful to the research.Also noted is the need for additional detail on the results obtained in comparison.This reaffirms the need for structured nested n-fold cross validation which not only provides an interpretation of the results, but also definitively reveals the optimal solution for the experimental problem through direct side-by-side comparison.

The
Pilot study compares Discrete Fourier Transform (DFT) with Discrete Wavelet Transform (DWT).Both DFT and DWT provide banded spectrum information.Equation (3) shows the transformation of the sampled data xn into the frequency domain Xk using DFT and Equation (4) shows the magnitude of the Fourier transform |Xk| used to remove phase information leaving only the frequency information.    Each element |Xk| now represents the power of that frequency within the original signal, fk ranging from zero Hz (sometimes called DC for electrical signals) all the way up to the Nyquist frequency of the sampled data window of width T seconds, as seen in Equation (5).

Fig. 4 .
Fig.4.Nested 10-Fold cross validation comparison of Machine Learning methods showing percentage of incorrect classifications with different DFT dimensionality presentation data.

Fig. 5 .
Fig.5.Nested 10-Fold cross validation comparison of Machine Learning methods showing percentage of incorrect classifications with different DWT dimensionality presentation data.

Figure 4
Figure 4 shows such a nested 10-Fold cross validation comparison showing percentage of incorrect classifications for DFT across a number of dimension reduction options, and a number of machine learning algorithms.Shown in the figure is the DFT with the full 256 spectral power values, as well as a number of reduced dimensions achieved by keeping only the lowest frequency values (keeping only the lowest 2, 4, 8, 16, 32, 64, and 128 power frequency values).Furthermore, a number of machine learning algorithms, some from the literature, and others used for comparison, are compared.They are, in order of listing in the graph key, kNN algorithm with k-1 and k-5, the Random Tree algorithm, the Random Forest algorithm, a Multi-Layer Perceptron with the number of hidden layer neurons = (# attributes + 2)/2, and a Support Vector Machine using a Hinge loss function.