A Semantic Approach for Mathematical Expression Retrieval

Math search or mathematical expression retrieval has become a challenging task. Mathematical expressions are very complex, they are highly symbolic, and they have a semantic meaning that we should respect. In this paper, we propose a similarity search method for mathematical expression based on a multilevel representation of expressions and a multilevel search. We used the K-Nearest Neighbors with three types of distances to evaluate relevance between expressions. In the experimental level, the proposed system significantly outperforms statistical algorithms. Keywords—Mathematical expression; Retrieval information; MathML; Semantic similarity


INTRODUCTION
The fast expansion of information technology and the spread of digital libraries in different domains make search engines necessary to help users to share and to retrieve any information from the web or from numerical libraries. Now days, search engines can search all kinds of documents including text, image, audio and video. Therefore, documents containing special data such as mathematical expressions, tables, diagrams and drawings cannot be retrieved by classical search engines.
As distinct from text retrieval, retrieving math expressions have been researched for several years. Now it's still in the research stage. There are a few researches in the field of Mathematical Expressions Retrieval (MER) and a few number search engines dedicated to this subject, like MathFind [1], Active Math [2] Wolfram search, Wikipedia search formulas and MathWeb search. Most of these number search engines are based on text retrieval techniques.
Mathematical expressions are highly symbolic and they have their own structures. For example [3]: • The order of elements in the mathematical expressions has semantic meaning, for example, ∑ sin (exp( )) and ∑ exp (sin( ))are two completely different expressions with the same elements but do not have the same orders. So it is important to respect the order of the elements in the mathematical search to retrieve the right expression.
• if there are two math expressions ( + ) and �( + ), the role of sub-expression ( + ) in each expression is different, and if the query is to find the square root of ( + ), the system must consider this particularity and all relevant expressions �( + ) should be strongly ranked.
• Mathematical equations can be written with different notation but they can have the same semantic meaning for example ( + ) and ( + ) are the same expressions.
Retrieving mathematical formulas with all these constraints requires a system based on semantic representations of the query math expressions. As examples of these representations, There are several common Mathematical Markup Languages: Latex [4] OpenMath [5], ASCII [6] and MathML [7].
MathML is an application of XML (Extensible Markup Language) for encoding notational and semantic structure of mathematical expressions. Actually, it is used by many systems for retrieving math expressions on the web [8,9,3].
In this paper, we propose an algorithm to extract features vectors of mathematical expressions represented on MathML and a multilevel search algorithm based on K-Nearest Neighbor's. We are going to use a variety of distances measure, first to evaluate the efficiency of our system and to find the best one for our system.

II. RELATED WORK
Retrieving mathematical equations has attracted much attention from researchers in the past decade, and several related systems or methods on this task have been reported. Currently, several researches have been realized to develop and improve retrieving mathematical equations from the web or in digital library.
The system proposed by Yokoi et al. [10] was a new similarity search scheme for mathematical expressions. They 190 | P a g e www.ijacsa.thesai.org started by introducing a similarity measure based on Subpath Set and proposed a MathML conversion that is apt for it. The aim of this method used is to return similar equations by measuring the similarity using tree matching techniques and by reforming the structure of content based MathML. Based on their First experiences, they believe that their proposed system has the potential to provide a flexible interface for searching mathematical expressions on the web. Tam T. Nguyen et al. [3] presented a lattice-based approach for mathematical search using Formal Concept Analysis (FCB) which is a powerful data analysis used for information retrieval [11]. This approach involves several phases. In the first time, they extract features from code MathML representation. These features are used to construct a mathematical lattice construction. At the query retrieval phase, the query expression is processed and inserted into math concept lattice, which matches with math expressions concept in the concept lattice to rank the relevant math expressions. The results have shown that the proposed approach has performed better than the conventional best match retrieval technique. Another important advantage of the proposed lattice-based approach lies in its support for the visualization and navigation of search results via a dynamic graph.
In their work [12], S-Q Yang and X-D Tian tried to research and develop special retrieval method. They proposed a maintenance algorithm of mathematical expression index based on Formula Description Structure (FDS), which includes the index item searching, inserting and deleting operations. Moreover, S.Q Yang et al. designed a matching model of mathematical expressions based on Formula Description Structure (FDS) index [13]. For realizing exact matching, the math retrieval attributes were embedded an index in three query modes called global query mode, local query mode and operational query mode.
L. Gao et al. [14] proposed a semantic enrichment technique to retrieve mathematical formulae from web pages and PDF documents with a novel query input interface, which allows users to copy formula queries directly from PDF documents without using formulas with Markup languages. They used a novel indexing and matching to search similar mathematical expressions based on both textual and spatial similarities. The proposed system achieves better performance compared with two representative mathematics retrieval systems.
MathSearch [15] is a formula-based search engine for mathematical information on the internet. In this system, Mathematical formula Query Language (MQL) [16] was designed for expressing and processing query. MQL contains two forms: a character string form (MQLS) and XML form (MQLX). By MQLX and MQLS, semantics query wildcard and combination query can be accomplished in MathSearch.
WikiMirs [17] is a tool to facilitate mathematical formula retrieval in Wikipedia. This system involves several phases. In the first phase, this system normalized Latex formulas of Wikipedia into a unified mode. Then terms were extracted from the normalized presentation tree. These terms reflect the features of the expression through series of processors such as presentation tree parser normalize and term extractor. These extracted terms used to establish an inverted index. In the last step, users query math expressions in latex form were processed with the above steps and retrieved.

A. Processing of Mathematical expressions
The processing of mathematical expressions as uniform mathematical representation plays an important role in the area of math search systems and digital libraries.

B. Extraction process
To find a structural and semantic similarity between mathematical expressions in a big data base or on the web scale level we need a reduced and efficient representation that respects the structural and semantic specification of mathematical formula. First we choose to act with a simple algorithm that counts the number of occurrence of each operator and Math function. In the second algorithm we propose to use a multilevel representation of expressions.

1) Statistical algorithm
It extracts through the MathML code the number of each operator (+, −, * ,/) , variables, constants, and functions ( , , , . ..) and stores them in a vector (Fig. 1). It's clear that there is no semantic similarity in the two examples of Fig. 2. As a result we need an algorithm also fast and more efficient regarding semantic similarity.

2) Proposed method
In order to define different levels for each math formula, we need to convert all mathematical equations into MathML code.
The first level was established by searching all main operators (+, -, *, /) linking all brackets and functions (trigonometric, logarithmic and algebraic) which all expressions into brackets and arguments of functions were defined and replaced by the term "exp". For example: The values of each "exp" are stored to be used in the second level.
The vector of level 1 is the outcome of the statistical algorithm applied to the reduced expression. What gives: one √, one , and 3 exp. So the representative vector of level one becomes: (3,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0) These features were extracted and stored in the vectors Vl 1 . In the 2nd level, we treated each "exp" by repeating the same procedure used in the first level and we stored the extracted features in the vectors Vl 2 . We continued applying this method until obtained all levels and all vectors Vli (Vl 1 , Vl 2 ,...,Vl n ) for each equation. In practice we use only 3 levels. RETRIEVING RESULTS The proposed method was based on the structural and semantic multilevel similarity between mathematical equations. The similarity degree was obtained based on all representation levels (Fig 3a, 3b).
After defining the expression levels and vectors Vli, in the first time we retrieve all expressions that have similar vectors Vl 1 to our math query. Then we move to the second level for only these equations already retrieved in the first level. The same procedure, used in the first level, was repeated to recover expressions that have similar vectors Vl 2 . We continue with the same procedure in level 3.
The K-Nearest-Neighbor (KNN) algorithm was used in this phase to retrieve math equation. It is a non-parametric lazy learning algorithm [19] and is an instance-based learning algorithm that uses a distance function of pairs of observation. KNN was based on the measurement of the distance to search a similarity between the query math equation and those of database. This distance is calculated using one of the following measures: • Euclidean distance: Euclidean distance is a special case for p=2 of Minkowsky distance • P=1 we obtain the Manhattan distance: • P=∞ we obtain the Chebychev distance: To evaluate the effectiveness of the proposed system we create a dataset of mathematical expression. The dataset is constructed using MathType. MathType is an interactive tool for authoring mathematical material, In the Microsoft Word or Power Point. There is MathType Ribbon Tab to facilitate editing, inserting and math equations creation. The dataset elements can be easily converted to MathML or Latex. In this set we have created 6925 mathematical expressions using symbols from five languages Latin, Arabic, Tifinagh [20,21], Hebrew and Japanese (Fig. 4 a, b, c, d, e). For each language, we have written 1385 different types of math expressions such as polynomial, algebraic, statistic, trigonometric and logarithmic. In this subsection, we present the results of the proposed system using Euclidean distance, Minkowski distance and Correlation distance. We compared our results to the statistical approach.
We found difficult to evaluate results using recall and precision evaluation using a similarity measure as proposed by T.T Nguyang et al. [3], in their paper the similarity measure between two expressions E1 and E2 represented by their attribute sets M(E1) and M(E2)is : We decide to give a score of tree points for identical similarity, two points for sub expression similarity and categorical similarity and zero for non-relevant expressions. Table I, shows the performance results of the proposed system using Euclidean distance, Minkowski distance, and Correlation distance compared to the statistical approach. The score is based on the top 10 results of 10 test queries. A perfect score is 120 points. The experiments show that results obtained using the proposed system outperforms the statistical approach. Using Minkowski distance our system become more efficient (a score of 94%) Table II and III show test queries with their relevant expressions in the dataset using simultaneously the proposed system with Minkowski distance and the statistical approach.  Our system with Minkowski distance allows a better detection of categorical similarity. We notice that the most of relevant expressions returned by the proposed system exactly match the query. For the statistical approach the absence of a semantic input can generate wrong outputs like relevance between sin 2 with 2 and √ + 1 with + 1 + √2.

VI. CONCLUSION
In this paper, we proposed a semantic approach to retrieve mathematical expressions. Based on MathML, we extract a multilevel representation for each mathematical expression. We used KNN with different types of distances to measure similarity between each representation level. In the light of our experiments we can conclude that the results are encouraging. Our system outperforms significantly the statistical approach and the implementation of Minkowski distance allows a better detection of categorical similarity. In our future work we have two goals, first to take into consideration more types of math expressions and second to evaluate our system on the web scale level.