Personalized Subject Learning Based on Topic Detection and Canonical Correlation Analysis

—To keep pace with the time, learning from printed medium alone is no longer a comprehensive approach. Fresh digital contents can definitely be the complement of printed education medium. Although timely access to fresh contents is becoming increasingly important for education and gaining such access is no longer a problem, the capacity for human teachers to assimilate such huge amounts of contents is limited. Topic Detection (TD) is then a promising research area that addresses speedy access of desired contents based on topic or subject. On the other hand, personalized education is getting more attention because it facilitates the improvement of creativity and subject learning of the students. This paper reveals a patented Personalized Subject Learning (PSL) system that caters for the need of personalized education and efficiently provides subject based contents. An efficient topic detection algorithm for providing subject content is presented. Moreover, since education contents are multimedia based ones with multimodal, PSL introduces Canonical Correlation Analysis (CCA) method to detect multimodal correlations across different types of media. Due to its novelty, PSL has been used as the key engine in a real world application of personalized education system as the smart education module sponsored by a Smart City project.


INTRODUCTION
Throughout most of history, only the wealthy have been able to afford an education geared towards the individual learners.For the vast majority, education has remained a mass affair, with standard curricula, pedagogies, and assessments.It has been believed so long as the system insists on teaching all students the same subjects on printed medium in the same way, progress will be incremental.However, now for the first time it is possible to individualize education --to teach each person what he or she needs and wants to know in ways that are most comfortable and most efficient, which may produce a qualitative spurt in educational effectiveness.How can we improve the performance in education, while cutting costs at the same time?In 1984, it was shown that individualized tutoring had a huge advantage over standard lecture environments: students who received individualized tutoring t performed better than 98 per cent of students from the standard classes.Yet the question is how to make individualized or personalized education affordable.Daphne Koller from Stanford AI Lab [1] argued that technology may provide a path to this goal.
Today timely access to fresh contents is becoming increasingly important in today"s education, and gaining such access is no longer a problem because of the widespread availability of broadband both in homes and businesses.Ironically, high-speed connectivity and the explosion in terms of the volume of digitized textual content available have given rise to a new problem, namely, information overload.Clearly, the capacity for human teachers to assimilate such vast amounts of contents is limited.Topic Detection (TD) has emerged as a promising research area that harnesses the power of modern computing to address this new problem by helping us obtain desired subjects or personalized topics in an automatic way.A topic is defined as a seminal event or content, along with all directly related events and contents .Thus, it is inferred that a topic consists of events and contents, both of which are defined in greater detail [2].A Topic Detection and Tracking (TDT) is defined as something that happens at a specific time and place, along with all the necessary preconditions and unavoidable consequences.Such an event might be a new movie, an election, or an alien attack.TD enables the automatic discovery of new topics from a news corpus and the subsequent assignment of news documents to the discovered topics [3].A new topic typically corresponds to a newsworthy incident such as the 2012 US presidential election.Therefore, TD technology is a perfect tool for clustering fresh subject contents.
Moreover, education contents are usually multimedia based.They can be texts, animations, sounds, videos, and so on.Text based TD solution alone is not able to do the final content fusion for personalized contents recommendation.In the process of the cross-media recommendation, the query examples and recommended results need not to be of the same media types.For example, students can receive sound pieces by submitting either an image example or a sound example.This is the so called multi-modal environment.Canonical Correlation Analysis (CCA) is then accommodated to calculate the correlations and measure multi-modality similarities across media types [4].
Despite the fact that existing TD solutions play important roles in their applications [3,5,6,7,8,9,10,11], they do not explicitly incorporate Language Model and cross-media CCA model into their formulations.Based on previous research [12,13,14,15,16,17,18], a novel personalized subject learning (PSL) system is created based on the above ideas.PSL system is a computer aided education system using TD technologies and CCA methodology.Enlightened by achievements in Information Retrieval (IR) field, Relevance Model (RM) is adopted as the language model for TD.RM is a theoretical extension of statistical language modelling and applicable in www.ijacsa.thesai.orgboth retrieval and TD [19].By treating education contents as news and stories, both TD and IR methods can be used to retrieve relevant contents and feed them into CCA to analyse cross-media correlations.
The remainder of this paper is organized as follows: In Section 2, key concepts and terms are defined and works directly related to PSL system are reviewed.Section 3 describes a novel approach in terms of TD and CCA.In Section 4, the superiority of the approach of PSL is demonstrated.Finally, in Section 5, conclusions and some future research directions are presented.

II. DIAGRAM AND EXAMPLE OF PERSONALIZED SUBJECT LEARNING
Personalized education refers to providing learning experiences tailored to each student"s interests and learning styles.It also implies student-directed and self-managed learning.Teachers may individualize instruction in a classroom setting but admit that this is hard to accomplish given the competing need to cover subject matter material.Well programmed computers, whether in the form of personal computers or hand-held devices, are becoming an alternative choice.They will offer many ways to master materials.Students (or their teachers, parents, or coaches) will choose the optimal ways of presenting the materials.Appropriate tools for assessment will be implemented too.Most importantly, computers are infinitely patient and flexible.With a computer aided personalized subject learning system, human beings can spend the precious classroom time on more interactive problem-solving activities, which may help them achieve better understanding and foster creativity.Once the personalized education takes hold, the world will be very different.Many more individuals will receive better education because they will be learning knowledge in ways that suit them best.
A personalized subject learning system consists of 17 components, as shown in Figure 1 Precious antiques (textual and image contents) from Old Summer Palace have been returned to China as shown in the 1 st and the 2 nd pictures in Figure 3.The Film "12 Chinese Zodiac" directed by Jackie Chan (video content shown in the 3 rd picture in Figure 3) has been co-related as teaching contents by CCA.The system aims to provide personalized education materials based on subjects or topics.Traditional information retrieval (IR) system is not able to meet such a demand.Hence, this paper proposes a PSL system by accommodating efficient topic detection method and canonical correlation analysis method.The former shoulders the task of fast clustering documents from vast and multiple textual content sources into clustered subjects or topics.The latter is responsible for recommending relevant or co-related contents www.ijacsa.thesai.orgThe following section of the paper illustrates the formal representation of two key components, that is, Component 7 for TD and Component 6 for CCA.

III. FORMAL REPRESENTATIONS
In this section, the formal representations of PSL system especially for TD and CCA are described in order.Although there are many language tracking and modelling methods based on machine learning, thus far, the Vector Space Model (VSM) [20] has achieved the best results [21].VSM has been successfully applied to the well-known SMART text retrieval system [22].There are a number of formal ways of describing relevance feedback, beginning with the notion of an "optimal query" used in the SMART system.The biggest advantage of VSM is to simplify the text as the vector representation by its features and weights.

A. Document Representation
Contents of the document are expressed by a number of feature items, which generally include the basic linguistic units, such as words or phrases.log Here N represents the number of documents in all data sets, k n represents the number of k t that appears in data set.
It can be seen that, the larger k idf value is, the less the documents which contain the given term.If all documents contain the same given item, k idf will be 0. In practice, to avoid such a case, equation ( 3) is improved by equation ( 4). log Generally, constant value is between 0 and 1, the equation ( 5) is then induced as: log 0.01 If the document length on the impact of weights is taken into account, the feature item weights are normalized into the range of [0, 1]:

B. TD Representation
The process of topic detection under this model is described here: According to descending order of word frequency, the former i words are taken as feature items.

5)
In TD research field, National Institute of Standards and Technology (NIST) and several universities, including Carnegie Mellon University (CMU), have been established benchmarks and corpus for TDT.In this paper, the similarity between T and d is defined as follows by adopting the principle reported by Lo and Gauvain of NIST [23]: ( | ) P w T is the probability of w in T .

C w T P w T
Nw T  (8) ( , ) C w T is the number of w occurrence in T , () Nw T is the whole number of terms in T , and is a priori probability of which is the statistic value in the background corpus.(9) Here is the number of occurrences in background corpus; and is the whole number of terms in background corpus.

6)
According to similarity measurement of NIST, topic detection is then described as the calculation of the similarity between the story and the topic.In other words, if ( , ) S d T   , then they are considered as relevant or on- topic, off-topic otherwise.

C. Model Design of TD
Kullback-Leibler divergence is used to compute Relative Entropy (RE) as relevance measure between topic models to compensate the semantic weakness with similar aim of [24].(10) M1 and M2 are the topic models for topic T1 and T2 based on RM.The two topic models, M1 and M2, both contain the word w .The equation (10) shows whether the two topic models M1 and M2 have semantic similarity.When value D is close to 0, the similarity of two models is high.In order to enhance the robustness of the model, the Clarity probability is introduced for this case when both two models have smaller dissimilarity but they are similar to background corpus [25].Such a phenomenon is called noise in that it is not a valid topic and therefore should be treated as a noise.Thus, equation ( 10) becomes the following one: 12) is used in the experiment for more convenience of code design and equation ( 12) is a conversion of equation ( 11): ) Such a TD model design facilitates code design that then achieves linear performance with the combination of full text retrieval and new algorithm as shown in [16].Other TD algorithms reported in literature have non-linear performance.
The following experiments show lower error rates than those reported in [2].

D. CCA Representation
Content-based multimedia retrieval is a challenging issue, as it aims to provide an effective and efficient tool for searching media objects.Almost all of the existing multimedia retrieval techniques are focused on the retrieval research of single modality, such as image retrieval [26,27], audio retrieval [28], video retrieval [29] and motion retrieval [30].However, interactions that enhance students" engagement with Information and Communication Technology (ICT) are multimodal and include gesture, touch, language and so on.Due to the multiple modality of contents, an approach to extend cross-media retrieval to a more generalized multimodality environment with less manual effort in collecting labeled sample data is needed.In this article, multi-modality representation [31,32,33] is adopted as it needs less manual effort in labeling multimedia documents already detected by TD module.In this subsection, the significance appears in inter-media correlation and solution of the problem of heterogeneous topics across different types of medium.
Co-relation of feature space X and feature space Y is defined as follows: () np X  is denoted for n samples and p variables.
) ( q n Y  is denoted for n samples and q variables.To obtain the main features, based on their feature weightage, a combination of variables from and Y is extracted: Here ， are subspace feature vectors.They are supposed to reduce the number of variables and use distribution of R and S to imitate that of X and Y. PSL uses relevance coefficient as in (14) and is optimized by (15).(14) is the covariance matrix of and Then with Lagrange multiplier method, is computed, which is a generalized Eigenproblem of the form ， and the sequence of Wx"s and Wy"s can be obtained by solving the generalized eigenvectors.Based on (13), minimum is computed to find out the correlation between . For example, let represents visual feature vector of motion (Component 1 of PSL) and represents feature vector of speech (Component 2 of PSL).Define by subspace mapping as ， by subspace mapping as .Here, subspace is meant for Multi-modality Laplacian Eigen-Maps Semantic Subspace (MLESS).
Due to the existence of large quantity of complex numbers, coordinate values in each dimension of the subspace are converted to their polar form: The same conversion is done for .The semantic distance between motion and speech is then as follows: PSL chooses the closest subject coupling with rich media contents and then provides recommends for students and teachers.
Topics are generally clusters of events and contents of specific subjects.To be personalized, clusters need to evolve as students and teachers learn more knowledge and the clusters are also able to optimize the feedback based on his or her experience, opinions, interests and creativity.In this way, personalized education material is finally achieved.In each evolution, the students or teachers have a chance to provide feedback regarding the recommended material and the feedback is treated as a guidance for next TD and CCA tasks.

IV. EXPERIMENTS
A Java-based personalized education system [43] has been implemented.This system can be easily deployed on any Java virtual machine (JVM) platform.

A. Topic Detection
As a testbed, the system gathered news reports from standard testbed of NIST"s TDT3 [2].Besides, fresh rich media documents from Xinhua News Agency are also added.The experiments tested the viability of our work, in the context of real time fresh online and offline contents of NIST.Detection rate is justified by means of link detection task (LDT) as stated in [16].

B. Relevance Feedback of CCA
By adopting CCA approach co-researched with AI Lab of Zhejiang University, the experimental results of the relevance feedback of CCA [33] fully utilize the contents relevant to the detected topics or subjects, in the context of the user"s opinions, creativity, personal knowledge and interests.

C. Practical Deployment
Practical deployment of our algorithm in real world is a patented system in both English and Chinese for personalized education as the smart education module of a Smart City project, as shown in Figure 4. Children"s interactions with the computer were frequently referred to, by adult teachers and children, as "playing with the computer" in the same way as they would talk about playing with the bricks or the model animals.The personalized subjects are presented in front of the children as shown in lower portion of Figure 5.This is not surprising inasmuch as the dominant ethos of personalized environments is that children learn through play like a game format [34]: "Children"s encounters with books, crayons, and paints were not referred to as play activities, probably because their role in the curriculum was easily identified and practitioners were used to recording children"s development in the areas of reading, writing, and drawing.Children"s freedom to choose resulted in highly varied patterns of engagement".With same opinions, three categories of teacher involvement, in PSL"s computer play, are reactive supervision, guided interaction and a hybrid approach that combines the elements of both.
The application of PSL research investigated learning in personalized settings and an adapted version of the framework and fundamental technology breakthrough have the potential to become research tools and to support changes in practice for professionals in other sectors of education.For example, it is by no means a novel observation that families play a key role in supporting children"s learning.
Published during the 1960s, the influential Plowden Report [35] has a section on the importance of parental attitudes and the "physical amenities" at home.It is recognized that children acquire almost as much general knowledge at the home as in the school, and almost as much information about the world   Parents can play the role of teachers in PSL since there has been a clear extension in the trends of education from formal settings to the home and more parental engagement [34].

V. CONCLUSIONS AND FURTHER WORKS
Due to its efficiency and effectiveness, such a breakthrough meets the practical demands in the fields of Community Question Answering (CQA) [36], social link management [37,38], learning for personal environment or R&D activities [39,40,41], preschool cognitive growth and hence, a distinguished patent has been granted [17].
Think about the guided interaction that helps practitioners to question the purpose of information and communication technology (ICT) and to articulate, reflect on and legitimize the changes in pedagogies.PSL prompts changes in the provision of resources, planning and assessment.Practitioners become more innovative, expand their definition of ICT as well as using existing resources in different ways, and begin to plan for, observe and record student"s engagement with ICT in new ways.The breakthrough of PSL in this paper appears not only in the fast TD based clustering but also for CCA based measurable rich media topic recommendation towards subject learning with persistence, engagement and pleasure.Personalized Subject Learning is becoming the trend for people to learn fresh contents.This research shows the capacity and efficiency to automatically deal with vast amounts of information and contents.Hence, the PSL system shows obvious applicability and availability.
In the course of this work, a number of interesting questions have been encountered that we hope to answer in future research.Besides satisfying multimedia contents, the PSL system is able to process multilingual contents in one shot.The research team is currently working on 52 other languages besides English and Chinese.An international PSL system across countries should cater for such a need in the future.It is planned to have in depth collaboration with teams in the States, Europe and Singapore which are keen on PSL and aim to form an international personalized education alliance along with this endeavor.
. Components 1-4 capture various sorts of input from students, including motion, speech, drawing and text input.Component 7 performs efficient topic detection task and contents are fed in from various sources (e.g.Component 11-offline contents such as digital library, Component 12-contents edited by teachers (component 17), Component 13-online contents (learning from books alone is no longer the way to keep pace with the time).Fresh online contents are definitely the complement of printed education medium.Component 9 records learning behaviour of students and stores them into behavioural log.Contents that deserve to be education materials are collected by Component 10.The Component 6 analyses the correlations among collected multimedia contents and recommends the personalized subjects and contents to students and teachers.Component 8 is responsible for relevance feedback based on likes and dislikes of students and teachers as a mean to justify and improve the effectiveness of PSL.The inputs and outputs of PSL system follow the sequence of Figure 1 To further elaborate theFigure, real examples are shown here in Figure 2 and Figure 3. Topics are detected and clustered as indicated by red arrows as shown in Figure2and then CCA decides which topic suits the personal needs of students as shown in Figure3.In Figure2, topics with "Old Summer Palace" have been detected and related contents are clustered to feed in the PSL system.

Fig. 3 .
Fig. 3. CCA Correlates Antiques Returned to Old Summer Palace and Films by Jackie Chan by inter-media correlation measure and relevance feedback within the detected topics.Two methods work together to complement to each other for comprehensive, personalized and subject based interactions.Students and teachers then have the easy access to the vast amounts of personalized education contents anytime.
, a document can be expressed as a vector of n dimensional vector space.Expression 12 ( , ,..., ) n D D w w w  is called as the Vector Space Model of D .The classic weight calculation method is TF IDF  in statistical methods.There are many ways to evaluate the significance of a term, ranging from simply identifying its existence to evaluating its distribution level in a document or in a whole corpus.The most common term weighting scheme for processing index terms is TF IDF  , which stands for term frequencyinverse document frequency [21].TF IDF  uses the term frequency and inverse document frequency of each feature item to calculate the weight.If ik tf (Term Frequency) represents the number of occurrences of k t in document i D , is a local statistic value which has different values in different documents.k idf is a global statistic value reflecting a given term"s distribution in all data set.The original definition of IDF is as follows:

Fig. 4 .
Fig. 4. Personalized multimodal subjects are shown for the students and teachers in the implemented PSL system (smart education module of Smart City project) Topic is defined as Represents the feature of news story d ; 12 ( , ,..., ) Tj f j n  represents the feature of topic T ； 2) Follow-up story is defined as 12 ( , ,..., ) di f i m  www.ijacsa.thesai.org 3) is the similarity of T and d .w is the feature item of T and d .( , ) tf w d is the frequency of w in d .
dL is the whole number of terms in d . is a smooth factor (0, 1) tuned to make the system achieve minimum cost when tracking TDT3 corpus.TDT3 corpus is created by NIST specially to accommodate Chinese news and stories.The smoothing technique is introduced to prevent data sparsity in unigram modeling.