Using Artificial Intelligence Approaches to Categorise Lecture Notes

Lecture materials cover a broad variety of documents ranging from e-books, lecture notes, handouts, research papers and lab reports amongst others. Downloaded from the Internet, these documents generally go in the Downloads folder or other folders specified by the students. Over a certain period of time, the folders become so messy that it becomes quite difficult to find our way through them. Sometimes files downloaded from the Internet are saved without the certainty that they will be used or revert to in the future. Documents are scattered all over the computer system, making it very troublesome and time consuming for the user to search for a particular file. Another issue that adds up to the difficulty is the improper naming conventions. Certain files bear names that are totally irrelevant to their contents. Therefore, the user has to open these documents one by one and go through them to know what the files are about. One solution to this problem is a file classifier. In this paper, a file classifier will be used to organise the lecture materials into eight different categories, thus easing the tasks of the students and helping them to organise the files and folders on their workstations. Modules each containing about 25 files were used in this study. Two machine learning techniques were used, namely, decision trees and support vector machines. For most categories, it was found that decision trees outperformed SVM. Keywords—Classification; lecture materials; machine learning; support vector machines; decision trees


I. INTRODUCTION
The rapid advancements in IT have brought about an exponential increase in the number of electronic documents.Documents that were presented on paper in the past are today created, stored, distributed and displayed digitally [1].This trend has captured a wide variety of fields, if not all.The education field has not been left behind in the process.It has evolved alongside with the advent of new technologies.
Students nowadays have thousands of files on their workstations, scattered in different folders, on different drives, etc.Some files have meaningful names while others do not.The easy access to information has also led to an increase in the amount of irrelevant information.Information from web pages, news articles, presentations, papers are saved on the machines without the certainty that they will be of some use in the future.This usually costs users a great deal of time looking for a particular file especially if all the files are scattered in different places on the computer system and the file in question is not properly named.Therefore, an automatic file classification system is of utmost importance.The role of the file classifier would be to go through all the files in a given folder and determine the best fitting category for each file.This paper proceeds as follows.Section II gives a description of the different techniques that are used for the classification process.Section III describes the methodology used and the tasks that need to be carried out to classify the documents.Section IV outlines the implementation process and critically analyses and evaluates the results of the classifiers.Finally, we conclude the study in Section V.

A. Text Mining
Text mining, also known as text analytics, is a hypernym used to describe the wide range of technologies in place to analyze and process unstructured and semi-structured textual data [2], [3].These technologies are used to extract meaningful information from documents or files that would then serve particular purposes.The most common theme behind all the technologies is to turn textual information into numbers.Algorithms are then applied to the numerical format of the words, documents and eventually to full databases.The data is then handled and processed as per to one"s requirements.
Text mining involves the applications of techniques from fields such as information retrieval, information extraction, natural language processing, machine learning, classification, clustering and text categorisation.Information retrieval is an area pertaining to the organisation, examination, storage and retrieval of information from different sources.It performs several tasks such as document ranking and document classification.This paper discusses two main classification techniques, namely decision trees and support vector machines.

B. Decision Trees
Decision trees are a very simple but powerful classification method.One advantage of a decision tree is that it can be very easily interpreted by humans.It is commonly used in pattern recognition problems for knowledge systems [4].A decision tree is very similar to a flow diagram.It consists of an internal node with many attached branches and leaf nodes.A test on a particular element is designated by the internal node.The branches denote the result of that experiment and finally, the class distribution is indicated by the leaf nodes [5].The www.ijacsa.thesai.orgtopmost node is known as the root node and it is denoted by an oval.Rectangles are used to symbolise the internal nodes.The leaf nodes, on the other hand, are circular in shape.
A list of attributes is made for measurement in order to create a decision tree.A target attribute is then chosen for prediction.All data is processed to know the number of times an attribute appears in each document.Decision trees use the concept of entropy for splitting attributesreducing the number of attributes.Splitting the attributes results in a hierarchy of branches.These branches or nodes are called the decision tree.All nodes can form another branch of node.Each branch in the tree produces an observation.This observation is made using the state of one of the fields in the dataset.Another method used for splitting is called pruning.There are two types of well-known pruning namely pre-pruning and post-pruning also known as forward pruning and back-pruning respectively.In pre-pruning, the user decides when to stop adding attributes during the building process.As a result, it can lead to very biased decisions as individual attributes do not contribute much to the decision.Post-pruning is different in that the decision tree is fully built prior to pruning the elements [6].
Decision trees are efficient for new and unseen inspections.However, building a decision tree can be very time-consuming.One serious weakness of decision trees is the problem of error propagation throughout a tree.Decision trees are built by a series of local decisions.These local decisions have a carryover effect.Therefore, if one of the local decisions goes wrong at some point in time, all successive decisions are bound to be bad as well.In such a case, the correct path of the tree might not be returned [6].

C. Support Vector Machines
SVM algorithms are a learning method introduced by Vladimir Vapnik and colleagues.They are used for pattern recognition, classification and regression.Support vector machines have been very successful in various learning areas [7], [8].SVMs construct hyperplanes for linearly separated patterns.The basic idea in SVM is to find a mediator which separates multi-dimensional data into two classes [9].SVMs work towards maximising predictive accuracy while avoiding over-fitting.SVMs give very significant results for applications involved in classifying text, recognizing hand-written characters, classifying images and also in bio-informatics.One of the strongest points for SVMs is that they impose no limit on the number of attributes that can be used.However, the only problem is that SVMs require a lot of memory [10].

III. METHODOLOGY
The very first step to the classification of the lecture materials is to build a dataset.A dataset in this study is simply a bulk of relevant documents.Eight categories of lecture materials amounting to 213 files were selected and were put in a common folder.Table I shows the categories and the number of files used in each category.NLTK (Natural Language Toolkit) has been used to process the files.It is the most commonly used platform to write Python programs to interact with textual data [11].It is open source software and is made up of a plethora of libraries to allow for the manipulation of high-level data.Firstly, the documents are converted to lowercase to avoid ambiguities at later stages.Secondly, the files are cleaned.All the punctuation marks, special symbols, digits and special characters are removed.The series of words is then subjected to the process of tokenization which breaks the documents into distinct words or tokens.Each word is then checked against NLTK"s stopword list.The stopword list is a large body of text consisting of 11 languages with a total of 2,400 stop words [12].Stop words are words like "the", "is", "a", that do not carry much weight when it comes to determining the best category of a file.Thus, all stop words are eliminated from the documents leaving us with only potentially useful and meaningful words.
The last step in the cleaning process is the application of stemming to the words, as shown in Fig. 1.Stemming is a method for removing the affixes from a word in order to end up only with the stem which is also known as the root.It is a common technique used in search engines for indexing words.The search engine stores only the stems, instead of keeping all the different forms of a word.This is very helpful as it reduces the size of the index by a considerable amount, thus improving performance and retrieval accuracy.One of the most popular stemming algorithms is the Porter Stemmer Algorithm.It removes and replaces well known suffixes of English words [13].NLTK supports a number of other stemming algorithms as well, namely the Lancaster stemmer, Regexp stemmer and the Snowball stemmer [14].For this project, the Snowball stemmer has been used.
Once the documents are cleansed, the array of meaningful and stemmed words is further processed to get the frequencies of each word in each document.The outputs are stored in CSV files.These CSV files produced are fed into WEKA [15].The following section gives more details about the classification process in WEKA and evaluates the classifier outputs.

IV. IMPLEMENTATION AND EVALUATION
WEKA supports a particular file format known as the ARFF data format.ARFF stands for Attribute -Relation File Format.It is an ASCII file describing a set of samples having a number of elements in common.The ARFF-Viewer tool in WEKA allows for the conversion of CSV data files to the ARFF data format.An ARFF data file has a very particular format.It basically has two distinct sections, the header part followed by the data information.It starts with @RELATION, which gives the name of the file, followed by @ATTRIBUTE, giving a list of the file's attributes and lastly @DATA.
All the attributes in an ARFF file are of type "numeric" since we are dealing with the frequencies of the words in the documents.The data is represented as a stream of numbers.Viewed in WEKA"s ARFF-Viewer, we are presented with a tabular form of the file (Fig. 2), which is easier to interpret.
The datasets for all eight categories of lecture materials were classified using two different machine learning techniques and the outputs were compared.From existing works, we have noticed that it is a common practice to test the algorithms with a balanced number of positive and negative samples.Thus, we have used an equal number of documents to carry out the experiments.A binary approach was followed, i.e.
for each category we took 15 positive samples and 15 negative samples (which was termed as the "Others" category).

A. J48
The datasets were first classified using the J48 decision tree algorithm in WEKA.J48 normally selects a set of keywords in the set to base its decision on [16].However, the selection of that keyword is not stable as a little change in the dataset may alter the results by a great amount.Also, the keyword chosen may not always reflect the intended category.An example is given in Fig. 3. Fig. 3 shows the classifier"s tree visualizer for Multimedia.The word "layer" has been chosen to decide between the Multimedia and the Others categories.This word however is not appropriate as it may be used in many contexts other than Multimedia.Words like "multimedia", "image", "video" would have been more appropriate in this case.www.ijacsa.thesai.org

B. LibSVM
The datasets were subjected to a second round of classification, this time with LibSVM [17].The classification for the Multimedia category, for instance, yielded very good results.All of the 15 documents pertaining to this category were correctly classified.Table II indicates that out of 15 files that are actually from the Others category, seven of them were correctly classified while the remaining eight were not.They were classified as Multimedia files instead of Others.As for the Multimedia files, it was an error-free classification.

C. Summary of Outputs
Table III shows a summary of the classifier outputs with J48 and LibSVM for all the 8 categories of lecture materials.
A pertinent observation is the meagre percentage of correctly classification instances for the Database category.Database is a very common field in computing.It merges with many other fields in a fluid manner and it may be applied in a variety of computing contexts.Therefore, files from Enterprise Resource Planning (ERP) and Management Information Systems (MIS) files may well fall in the Database category.This is one potential reason for the downfall in the positive percentage for this particular category.The overall accuracy for J48 is 83.3% while for SVM it was 76.7%.From these statistics and from Fig. 4, we can see that J48 has done slightly better in this scenario.

D. Accuracy of Outputs
The accuracy of the classifier outputs in WEKA is determined by some very distinct parameters.These parameters are: True Positive Rate (TP Rate or Recall), False Positive Rate (FP Rate), Precision and the F-measure.Table IV shows the accuracy by category for both classifiers.A TP rate of one is an ideal result.It means that all or almost of the documents were correctly classified.All Security files were correctly classified, hence yielding a recall of 100% with both classifiers.Fields like Software Engineering and Cyberlaws, which are quite distinct from the rest, have also fetched high values.The recall value for Multimedia is exceptionally low for the SVM classifier.However, the explanation for this can be seen in Table II.This is because many files from the Others category were classified as being in the Multimedia category due to the presence of certain superfluous words.Nevertheless, the precision values are very high.A TP rate as low as 0.4 is an undesirable result, which is indicative of poor classification of the files.It is noticed that the TP rates for ERP and MIS are not very high too.These values point towards the confirmation of the observation that the modules ERP, MIS and Database bear a lot of similar words, hence some files were incorrectly classified.In general, the values for precision and recall were appreciably high.

V. CONCLUSIONS
This paper discussed the classification of lecture materials.Two hundred and thirteen documents from eight different university modules were selected and were classified into predefined sets.The documents were classified using two different machine learning techniques namely decision trees and support vector machines.A number of experiments were carried out and the results of the classification were critically analysed.The outputs" parameters and various other factors showed that J48 was a better classification technique than SVM for this particular case.The overall accuracy for J48 was found to be 83.3% while for SVM it was only at 76.7%.However, these results cannot be generalised as our data set was quite small.In the future, we intend to repeat these experiments with many more files and more classifiers such as kNN, Naïve Bayes and artificial neural networks.Document size, i.e. the number of words in each file will also be taken into consideration.

TABLE I
Fig. 1.Flowchart Outlining the Steps of the Implementation Process.www.ijacsa.thesai.org

TABLE III .
SUMMARY OF CLASSIFIERS OUTPUTS

TABLE IV .
ACCURACY BY CATEGORY