An Intermediate Representation-based Approach for Query Translation using a Syntax-Directed Method

We aspire to make one query reasonably sufficient to extract data regardless of the data model used in our research. In such a way, users can freely use any query language they master to interrogate the heterogeneous database, not necessarily the query language associated with the model. Thus, overcoming the needing to deal with multiple query languages, which is, usually, an unwelcome matter for non-expert users and even for the expert ones. To do so, we proposed a new translation approach, relying on an intermediate query language to convert the user query into a suitable query language, according to the nature of data interrogated. Which is more beneficial rather than repeat the whole process for each new query submission. On the other hand, this empowers the system to be modular and divided into multiple, more flexible, and less complicated components. Therefore, it increases possibilities to make independent transformations and to switch between several query languages efficiently. By using our system, querying each data model with the corresponding query language is no longer bothersome. As a start, we are covering the eXtensible Markup Language (XML) and relational data models, whether native or hybrid. Users can retrieve data sources over these models using just one query, expressed with either the XML Path Language (XPath) or the Structured Query Language (SQL). Keywords—Data Model; Relational Database; eXtensible Markup Language (XML); translation; model integration; intermediate representation; ANTLR (ANother Tool for Language Recognition)


I. INTRODUCTION
The relational database has been the most data model used in most organizations to store and manage data. Likewise, the XML (eXtensible Markup Language) is progressively utilized as a universal solution to exchange data over the internet. At which point, many projects, and studies have been interested in integrating them and find means to interrogate both data. Some researchers focused on storing and querying XML data using a relational database system [1] [2]. Some others attempt to create general systems to manage XML, among other data formats [3]. However, approaches mentioned above have considerable advantages indeed but, along with limitations too, to some degree [4].
Nevertheless, by exploring some other orientations to query heterogeneous databases, especially those based on query translation, we perceived some related aspects to our intentions. Accordingly, adopting a translation tool can efficiently meet our aim, and using a syntax-directed approach would be a correct solution. To empower the process, we generate an intermediate query language that reflects the logical interpretation of the query. We called it the universal query language (UQL); a transitional phase that provides an intermediate representation to switch between steps accurately, instead of converting the source query language directly to the target query language. The system is capable of performing queries against XML and relational databases and against hybrid ones too.
Henceforth, there is no need to be familiar with the many query languages to access data from variant data models, nor to express queries with precisely the suitable query language that corresponds to the data model used to structure that data. One query represented with either the Structured Query Language (SQL) or The XML Path Language (XPath) is moderately enough [5].
We are relying on the syntax-directed translation method, in which the parser drives the source query language translation. Therefore, semantic analysis and interpretation are performed based on the syntax structure. For the hands-on part in building language processing tools, handwriting the parser may work, but it is obviously not the best approach in complex cases. Alternately, using a powerful parser generator can save us time, effort, and resources as it is capable of automating momentous phases along the process. For that, to implement the parser, we are using ANTLR (ANother Tool for Language Recognition). It takes as input a grammar that specifies a language and generates as output source code for a recognizer for that language. A language is specified using a Context-Free Grammar (CFG), expressed using Extended Backus-Naur Form (EBNF).
The paper is organized as follows: This introduction introduces the general context of the project. Section 2 brings in some preliminaries and terminologies. Section 3 presents our objects and summarizes the mechanisms of the overall system and the translation process. Section 4 explores the language recognition and processing phase. Section 5 presents the intermediate representation phase. Section 6 discusses the data extraction phase and the nature of the database understudy that can be handled by the system. Finally, Section 7 concludes.

A. Describing a Language using a Grammar
A regular expression is quite useful but also leave little to no room for extension. Not all patterns can be described using regular expressions. The most obvious limitation is the lack of recursion. Statements can quickly turn out messy and hard to maintain [6] [7]. Thus, regular expressions are not quite 563 | P a g e www.ijacsa.thesai.org enough. Instead, CFGs, the type-2 grammar in Formal Grammar Hierarchy classification as known as Chomsky Hierarchy [8], would be a great deal to define the syntax of a language.
Formally, A CFG [9] is a 4-tuple (N; ∑; S; P) where: • N is a finite set of variables called nonterminals; • ∑ is a finite set of terminals; • S: An axiom is it the start nonterminal • P is a finite set of productions (rewrite rules).
Each production has the form N→ (N ∪ ∑) * The head consists of a single nonterminal, and the body is a sequence of terminals and nonterminals.
We use the CFG to replace nonterminals by a string of nonterminals and terminals. The language of grammar is the set of strings it generates. A grammar could tell us the valid options to put together a piece of code for a given language and help us recognizing and identifying typical portions structures quickly.

B. Grammar Notation
There are many ways to describe a grammar, but we are using EBNF [10]. It is an extended version of the BNF (Backus-Naur form), an unambiguous, formal and mathematical way to specify CFGs. It is more concise and widely used as a formalism to describe a formal language grammar with a precise structure. It can be considered a metalanguage as it is a powerful way to define other languages. An EBNF grammar of a language consists of a set of terminal symbols and a set of productions for nonterminals, which shows the way terminal symbols are combined into a proper sequence.

C. Another Tool for Language Recognition (ANTLR)
It is possible to handwrite a parser from scratch, but this process can be complex, error-prone, and hard to change. Instead, there are many parser generators like Bison and Yacc [11] that take a grammar expressed in a domain-specific way, and generate code to parse that language. We are using ANTLR [12] [13] [14], It is a parser generator that uses LL(*) parsing [15] [16]. It takes a grammar as input and generates parsers that can build and walk parse trees and generate abstract syntax trees that can be further processed with tree parsers. From antlr.org, ANTLR is a powerful parser generator for reading, processing, executing, or translating structured text or binary files. Also, ANTLRWorks [17] is a great ANTLR grammar development environment.
• Hive and Pig Languages use it to parse Hadoop queries, • Twitter uses it to parse queries.

III. AIMS AND MECHANISMS
A database is a set of information stored by a tool according to a data model, a defined structure. To extract and manipulate this information, we need a query language. Sources can be stored according to any model; this means that it can be heterogeneous. Now, let us assume that we have faced one of these scenarios: (1) A user who has some data sources in XML, and knows only SQL. (2) A user who masters XPath and wants to access relational data. The common point between these two use cases is that the user query language does not match the interrogated data model. It is not easy to retrieve data because the appropriate query language is needed, which is XPath, for example, for the first case, and SQL for the second one. Besides, most of the time, users cannot master all of these query languages all at once. Each query language has a particular specification and probably challenging to learn. It is where our proposed system comes in. To overcome the dependencies between the data model and the query language, we develop a system to extract data regardless of the nature of the model used (XML or relational). Using only one single query posed freely with any query languages (SQL or XPath), as explained in Fig. 1. As shown in Fig. 3, it all begins from the user, who is free to choose between two different query languages, SQL, or XPath, to express the query and submit it to the reader. The latter provides a uniform interface between users and the system, and read their queries. At the outset, we are dealing with characters, but we aspire to get an abstract syntax tree that enables us to perform other actions for analysis. That is where the language recognizer phase (Section 4) takes action. It consists of two parsers, one to parse SQL queries and the other to parse XPath queries. For that, we developed a lexer grammar and a parser grammar for each query language. At the end of this stage, the output is a parse tree that will be fed to the Analyzer, where the abstract syntax tree is built and processed. Then, selecting only the relevant information to develop our universal query language in the UQL Builder phase using the mapping rules module to map each part of the query with a suitable part in the UQL. After that comes the role of the translator in translating it into the target language. Then, the converted query is executed in the data extraction phase.

IV. LANGUAGE RECOGNITION AND PROCESSING
The previous section presents the overview of the translation principle briefly, and the phases pursued to convert the source query into the target query. This section describes the several steps in the language recognition and processing phase.
ANTLR admit three variants of grammar specifications: lexers, parsers, and tree walkers or tree-parsers, as shown in Fig. 4. All of them are alike, and the generated files behave the same way because ANTLR uses LL(k) analysis for all of them.
The lexer reads the input character by character and translates it into a sequence of syntactical units called Tokens. Then, fed to the parser, which takes a stream of tokens and produces a parse tree according to the grammar rules. Afterward, the tree walker process the Parse Tree produced.

A. Query Language Specifications
The first step is the grammar. Because we are covering XML and relational database models, we need to define the grammar of their query languages, namely XPath and SQL.
SQL is a powerful query language for managing and manipulating data and can fit almost every interaction aspect.
However, as the objective herein is interrogating data, we are focusing specifically on the select command, whose syntax is as follows in Fig. 5. Similarly, we are focusing on the most critical construct of XPath, the location path. Fig. 6 illustrates its EBNF notation.

B. Diagrammatic Form
ANTLR uses a simple EBNF-like syntax to define the grammar. For example, a column's syntax in a select clause can be written, as shown in Fig. 7, and presented in Fig. 8 using the railroad diagram.
Lexer rules start with an uppercase letter, and parser rules start with a lowercase letter. Each rule has one or more patterns that it matches.
The K_AS? Means matches zero or one occurrence of K_AS.
'|' mean alternative patterns for the rule.

C. Parsing Queries
Parsing has the following phases: lexical analysis, syntactic analysis, semantic analysis. In the lexical Analysis (Tokenization), the lexer split up the user query into tokens and defines precisely how these tokens can be recognized. It reads a character stream as an input and generates a token stream as an output. Some tokens can be discarded like whitespaces; they are ignored during parsing. In the syntactic analysis, the parser figures out the relationship between the tokens that the lexer has produced to generate a parse tree, a data structure that reflects the input query's syntactic structure. In the Semantic analysis, the parse tree is checked for invalid semantic.   The next section explores the process of generating the intermediate query language after parsing the source query, which is an in-between phase that helps generates the target query quickly. Section 6 provides further details regarding how the extraction works.

V. INTERMEDIATE REPRESENTATION
The process of building the UQL starts from the output of the previous phase: the language recognizer. Then more steps have to be proceeded to get brief details that are needed, short, and to the point to efficiently generate the UQL. Besides, the parser generator builds a Concrete Syntax Tree (CST), not an Abstract Syntax Tree (AST). The CST reflects exactly the form of the grammar, every detail described in the syntax. It is like another representation of the grammar. That may seem easy to create but difficult to analyze and performed further interpretation with it. Whereas the AST contains only the mandatory elements needed and discard irrelevant details and extra information. It is more clear, compact, and easy to process than a parse tree. It is almost a direct translation of the grammar. We can get the abstract syntax from concrete syntax   We get the following query after applying the unification principle illustrated in Fig. 12, along with the mapping rules to generate the UQL. We will take the same example from [5].  (select_statement (select_core (selectClause select (list_columns (column (expression (column_name (any_name first_name)))) , (column (expression (column_name (any_name last_name)))))) (fromClause from (list_tables (table_or_subquery (table_name (any_name Employee))))) (whereClause where (list_conditions (expression (expression (column_name (any_name id))) (comp_operator =) (expression (literal_value 1))))) ;))  Vol. 11, No. 8, 2020 Now just one more step to complete this phase is the XML document validation. A well-formed XML document is one that conforms to the syntactic rules of the XML language. When an XML document has an associated DTD (Document Type Definition) or XSD (XML Schema Definition) and respects it, it is said to be valid. Validation is a way to verify that the document conforms to a grammar. We use the XML Schema depicted in Fig. 13 to describe the structure of our XML document.

VI. DATA EXTRACTION
The system can access data from heterogeneous data models, namely relational and XML, since the relational database has been the popular option to store and manage data since 1970 [24]. It is still the most data model used in most organizations and powerful database systems [25]. Likewise, XML is widely used as a standard to exchange data over the internet, and the Native XML Database tends to be a practical solution for variable data [26] and provides full support for XML query languages such as XPath or XQuery [27]. The system can also access data from a hybrid database, as major relational database management systems (DBMS) are appealing for hybrid engines so that they fit XML into a relational database environment [3], for instance, Oracle [28] [29] [30], IBM [31] and Microsoft. Furthermore, the extension to SQL for XML, SQL/XML, is making good advancements [32] [33].
AS shown in Fig. 14, after executing the query against the suitable database, we perform other transformations to determine what is to be done with the data, and how to go about doing it. Lastly, format the answer according to the user preference, if the choice is indicated. If it is not the case, we apply the obvious choice, a tree form for XML sources, and tabular form for relational and hybrid databases. The tabular layout is the selected format by default.

VII. CONCLUSION AND OUTLOOK
This paper presents our intermediate representation-based approach for translating queries, relying on the syntax-directed translation technique, to access data from heterogeneous sources and get the most out of each technology. The intermediate transition aids in empowering the system feature, especially in terms of independence. So that matching the data model with the query language corresponding remains no longer bothersome or a burden. Herein, we covered XML and relational data models, whether native or hybrid and hoping to incorporate others in future contributions.