A Novel Multidimensional Reference Model For Heterogeneous Textual Datasets Using Context, Semantic And Syntactic Clues

With the advent of technology and use of latest devices, they produces voluminous data. Out of it, 80% of the data are unstructured and remaining 20% are structured and semi-structured. The produced data are in heterogeneous format and without following any standards. Among heterogeneous (structured, semi-structured and unstructured) data, textual data are nowadays used by industries for prediction and visualization of future challenges. Extracting useful information from it is really challenging for stakeholders due to lexical and semantic matching. Few studies have been solving this issue by using ontologies and semantic tools, but the main limitations of proposed work were the less coverage of multidimensional terms. To solve this problem, this study aims to produce a novel multidimensional reference model using linguistics categories for heterogeneous textual datasets. The categories such context, semantic and syntactic clues are focused along with their score. The main contribution of MRM is that it checks each tokens with each term based on indexing of linguistic categories such as synonym, antonym, formal, lexical word order and co-occurrence. The experiments show that the percentage of MRM is better than the state-of-the-art single dimension reference model in terms of more coverage, linguistics categories and heterogeneous datasets.


I. INTRODUCTION
"Big Data" refers to data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time.Various industries with heterogeneous data are facing problems related to storing, managing, retrieving, and analyzing of large amount of data.Big Data plays an important role in retrieving useful information from the large datasets with the help of advanced tools and algorithms [1].Nowadays, data produced in formats such as structured, semi-structured and unstructured data from a multidimensional nature of resources and applications that cannot be processed through simple tools [2].
In general, Big Data can be explained according to three V's: Volume, Velocity and Variety [3].Also, the other characteristics of Big Data described in [4] are volume, variety, velocity, veracity, valence, and value.Later on, in [5] 10V's volume, variety, velocity, veracity, variability, viscosity, volatility, viability, validity, and value are exposed.
In Big Data Variety, the heterogeneous types of data formed, and it further classified in three types namely, Structured, Semistructured and Unstructured (SSU) [6], [7].Structured data is organized data in a predefined format and stored in tabular form whereas semi-structured data is a form of data which cannot be queried as it does not have a proper structure which confers to any data model and unstructured data is heterogeneous and variable in nature such as text, audio, video, and images.Due to heterogeneous data, it cannot be processed with simple tools and techniques which create the problem heterogeneity and similarity matching [2] in result, decision maker cannot make decision based on scattered data.
With the advent of the technology, the computers are nowadays used to retrieve the linguistics information from textual data which is known as Computational Linguistics (CL) [8]- [9].CL is classified into many categories but among them context clues, semantic, and syntactic [9]- [11] matching is widely used in the domain of linguistics.CL helps in identifying and matching of related words from input datasets with the data dictionary which is known as domain knowledge [12].
The domain knowledge further known as reference model (RM) have been used in the field of NLP and semantic-lexical matching.Vasilieous et al. in [13]- [15] proposed a single dimensional reference model (SRM) for medical data quality of textual dataset.The SRM only matches one token to one term at time and it was developed for structured dataset whereas the same patient's data can be represented in other forms of terms.Also, in other formats (semi-structured or unstructured).
Therefore, this paper proposes a multidimensional reference model (MRM) for one token to many terms matching and as well as for heterogeneous datasets.
The concept of multidimensional reference was adopted from [16]- [17], in which different schemas for one to one and one to many queries for NoSQL Injection were proposed as well in [20]- [21] www.ijacsa.thesai.org The aim of this study is to solve the requestion question i.e.How to build a context, semantic and syntactic based reference model for more data inclusivity?Which can be achieved through this research objective i.e., to develop a multidimensional reference model (MRM) based on context, semantic and syntactic bag of words for a better data inclusivity.The significance of this research is to measure the inclusivity of Semantic, context and syntactic words in MRM.
For further understanding about multidimensional reference model for heterogeneous textual datasets this paper is organized as follows: Section II and III describe the related work and methodology adopted for creating the MRM and experiments conducted on heterogeneous datasets, Section IV presents the results for heterogeneous datasets while Section V discusses the results and Section VI and Section VII presents the conclusion and future work respectively.

II. RELATED WORK
Ordinarily, the reference model works as a procedure that contains the domain knowledge and relevant indexing of a topic or information of interest.It works as a common template for structured data that contains a set of parameters which are important for generating the domain knowledge [14].
The proposed multidimensional reference model as shown in Fig. 8.It comes with extra features to handle heterogenous datasets.It uses a generalized natural language concept and domain knowledge which helps the input datasets in selection of appropriate multidimensional domain data.Multidimensional indexing is also an added technique which classifies linguistic words into context, semantics, and syntactic clues.
These three categories aim to assist in building the vocabulary and understanding the domain knowledge with respect to meaning, structure and representation of words as opposed to the existing reference models where the selection of terms is solely based on one-to-one relationship (see Fig. 1).It can be observed from Fig. 1 that the relationship is oneto-one.Meaning, for any given token, a corresponding match term is retrieved from the reference model.This selection is based on threshold values to identify the best matching term in the corpus.The term with the highest value is selected as a candidate for data curation.
On the contrary, the proposed multidimensional reference model utilizes a multifaceted concept as depicted in Fig. 2 below.Basically, the MRM checks the relationship between the token and its related term in multiple dimensions in order to identify the most appropriate term as shown in Fig. 2 and Fig. 8 (Appendix A) As illustrated in Fig. 8 (Appendix A), we can see that a token from the input dataset is matched with its potential related term in several dimensions such as synonym [18], semantic, lexical [19], etc.For instance, a token "bank" could score high when matched with a term "boundary", which means the edge of a river.However, if the lexical matching of same word is conducted, a financial institution, or storage may be flagged off.Therefore, it's very important to view one token in different dimensions.This will significantly increase the accuracy of terms matching at different levels of data harmonization.As mentioned earlier, the first type of semantic clue is formal semantics which uses techniques such as logic, philosophy, and math to analyze data within the relationship of language and reality, truth, and possibility.The list of words and their score can be found in Appendix E and F.
The third and last category of MRM is syntactic clue which focuses on the word order and co-occurrences.In order to identify patterns amongst data points (words), the order and cooccurrences are adopted and implemented.The list of words for both the order and co-occurrences is offered in Appendix G and H.Among five datasets, the ACE2020 dataset is XLS (structured) format which contains the information about labeled text news produced and recorded at different news agencies.The dataset comprises of 621 news of different categories.Whereas Aquaint dataset is in TXT (Unstructured) format which contains 50 different news produced in diverse nature and rich in information.This dataset comprises of 729 lines.On the other hand, Sarcasm headline dataset is JSON (semi-structured) format which also contains the information about the news headline.This dataset comprises of 26709 lines.All datasets are purchased by LDC organization for research purpose.
In step one, the tokens are generated from heterogeneous datasets.The input datasets contain news of the daily life including sarcasm (keys and values).Participating datasets are in structured (Xls), semi-structured (JSON) and unstructured (Txt).After preprocessing the input datasets, structured dataset is formed which have been used for token generation.
In second step the root words are identified based on the generated tokens.In the third and fourth steps the determining the dimensions and aggregating them into categories of root words are formed.As stated above, the indexing scheme of dimension follows the concept of one-to-one and one-to-many cardinalities from SQL.
For implementation of MRM, research was carried out on the MMR development stages (see Fig. 3.).The experiment aimed to assess the performance of MRM.A single workstation was used for the experiments.It housed the following specifications: GPU: NVIDIA Tesla P100 12GB Passive GPU, CPU: Intel Xeon E5-2620 v4 2.1GHz, 32 cores, 128GB RAM, 800GB SSD, 1GB bandwidth ethernet card, and windows operating system.Textual datasets with numerous characteristics and sizes of 75KB, 150KB, and 10MB are employed.For performance evaluation, Anaconda and Python 3.7.3 are installed on the workstation along with Jupyter notebook, pandas, NumPy, matplotlib, and orange3 libraries.
Based on the root words and dimensions of MRM, the most common words using the linguistics words categories are retrieved and named as mrm_words.It contains the mrm_score() which will help in DH.
Validations of results (MRM with SRM) are discussed in following section.

IV. RESULTS
The proposed Multidimensional Reference Model (MRM) was developed using linguistic word categories i.e., context, semantic and syntactic clues.The main aim of developing MRM is to improve the quality of terms-matching by referring to the target terms in different dimension.This is achieved with the help of indexed based domain knowledge to rootwords/tokens.Indexing is generated and classified using synonyms, antonyms, lexical semantics, formal semantics, word-order, and co-occurrence.
Each word has its respective score (mrm_score()) that is empirically assigned which helps in matching terms based on defined rules, semantic, and lexical matching.The total number of words generated from linguistic word categories (i.e., context, semantic and syntactic clues) for MRM repository is 37321.
The performance of proposed MRM with existing single dimensional reference model (SRM) is compared and presented in this section.Five different heterogeneous datasets namely, ACE 2020, Aquaint, Sarcasm, HUA, and UoA are implemented on both SRM and MRM in order to obtain a justifiable conclusion on which reference model is actually better.It's important to mention here that SRM was implemented in a similar comparison on two out of the five aforementioned datasets (i.e., HUA and UoA).This indicates that our comparison is more rigorous in nature as it covers all data structures (heterogenous, to be precise).
The experiment was conducted five times (Batch 1-5) for each dataset.Batch 1 utilizes 20% of each dataset, and continuously increases 20% for the subsequent batches until 100% of each dataset is tested.This is done for both MRM and SRM to evaluate their individual performances.The batches and their respective data distributions are explained in Error!Reference source not found.The Table I (Appendix B) shows the results of the experiments conducted on MRM which presents total terms of input datasets, total matched terms with MRM and percentage of matched terms.
In order to evaluate the performance of best reference models on participating datasets, the experiments are conducted on five different batches of datasets as presented in (Appendix B).The two collaborating reference models are tested five different times for each variable.After that an average of scores for five round is taken and compared, the results of each round are presented separately.Figure Error!No text of specified style in document.1 illustrates the results of round one in which a comparison between the MRM and SRM for total terms and matched terms are discussed.Fig. 4 depicts a significant result of the round 1 for all participating datasets using SRM and MRM.On left of the figure, the results of existing SRM and on the right the results of proposed MRM are shown.The first set of analysis begins with performance of SRM on participating datasets.Initially, 2564 terms of ACE2020 were tested on SRM, out of which 1212 were matched successfully.Secondly, 2192 terms of Aquaint dataset were examined, out of which only 551were matched.Similarly, 5740 terms of Sarcasm dataset were tested out of which merely 1198 were matched.Subsequently, 16 www.ijacsa.thesai.orgterms of UoA dataset were examined on SRM, out of which 15 were matched.Lastly, the 12 terms for HUA were tested and out of which 11 were matched.The results show a variation in matching of terms with the use of SRM, but it performed well on the UoA and HUA datasets.On the other hand of analysis, the input datasets are used to test the performance of MRM.At first, 2564 terms of ACE2020 were tested on MRM, out of which 2080 were matched magnificently.Subsequently, 2192 terms of Aquaint dataset were examined, out of which only 1678 were matched well.Similarly, 5740 terms of Sarcasm dataset were tested out of which 4568 were matched perfectly.Afterwards, 16 terms of UoA dataset were examined on MRM, out of which 11 were matched.Last of all, the 12 terms for HUA were tested and out of which seven were matched.The performance findings from this round suggest that the MRM performed better than SRM on ACE 2020, Aquaint and Sarcasm datasets whereas the SRM works better on HUA and UoA datasets.Fig. 5 illustrates the terms matched (in terms of matched percentage) with both reference models on participating datasets.
The percentage of matched terms using SRM for ACE2020, Aquaint, Sarcasm, UoA and HUA are 47%, 25%, 20%, 93% and 91%, respectively.Whereas the percentage of matched terms using MRM for ACE2020, Aquaint, Sarcasm, UoA and HUA are 81%, 76%, 79%, 68% and 58%, respectively.This is because the proposed MRM covers multiple dimensions such as context, semantic and syntactic clues.One of the significant contributions of MRM is that it checks each participating word/token from input dataset with index based domain knowledge.
With that, the input tokens are checked multiple times and based on the context and similarity score of the index it produces very similar words.It is worth noting that if any of the tokens' score is high based on the similarity, but the score is less in terms of context than the terms which matched based on the context are selected.Whereas the existing SRM only checks the similarity based on string and lexical similarity and only in single dimension.
Comparative analysis on the results of SRM and MRM shows that the performance of MRM is better than the SRM on ACE 2020, Aquaint and Sarcasm datasets while the SRM performs better on HUA and UoA datasets.The results of MRM on UoA and HUA datasets are low which is due to different domain knowledge (medical) of the datasets.In Fig. 6, the performance of SRM and MRM are measured for batch 5 on contributing datasets.The remaining batches (2-4) are not presented here but the average of all five batches is presented in Table I

. (Appendix B).
A significant result of the round five for all participating datasets using SRM and MRM.On left of the figure, the results of existing SRM and on the right the results of proposed MRM are shown.The first set of analysis begins with performance of SRM on participating datasets.Initially, 12820, terms of ACE2020 were tested on SRM, out of which 5605 were matched successfully.Secondly, 10960 terms of Aquaint dataset were examined, out of which only 2405 were matched.Similarly, 28700 terms of Sarcasm dataset were tested out of which merely 4701 were matched.Subsequently, 82 terms of UoA dataset were examined on SRM, out of which 70 were matched.Lastly, the 60 terms for HUA were tested and out of which 50 were matched.The results show variations in matching of terms with the use of SRM, but it performed well on the UoA and HUA datasets.
On the other hand of analysis, the input datasets are used to test the performance of MRM.At first, 12820 terms of ACE2020 were tested on MRM, out of which 9125 were matched magnificently.Subsequently, 10960 terms of Aquaint dataset were examined, out of which only 7865 matched well.Similarly, 28700 terms of Sarcasm dataset were tested out of which 19998 were matched perfectly.Afterwards, 82 terms of UoA dataset were examined on SRM, out of which 50 were matched.Last of all, the 60 terms for HUA were tested and out of which 31 were matched.For validation of performance, the SRM and MRM results are presented here.The comparison of performances (in terms of percentage) is depicted in Fig. 6.The performance findings from this round suggest that the MRM performed better than SRM on ACE 2020, Aquaint and Sarcasm datasets whereas the SRM works better on HUA and UoA datasets.Fig. 7 illustrates the terms matched (in terms of matched percentage) with both reference models on participating datasets.The percentage of matched terms using SRM for ACE2020, Aquaint, Sarcasm, UoA and HUA are 44%, 22%, 17%, 85% and 83%, respectively.Whereas percentage of matched terms using MRM for ACE2020, Aquaint, Sarcasm, UoA and HUA are 71%, 72%, 70%, 61% and 52%, respectively.This is because the proposed MRM covers multiple dimensions such as context, semantic and syntactic clues.
One of the significant contributions of MRM is that it checks each participating word/token from input dataset in domain knowledge by adopting the functionality of indexing.With that, the input tokens are checked multiple times and based on the context and similarity score of the index.It's worth noting that if any of the tokens' score is high based on the similarity but the score is less in terms of context then the terms which matched based on the context are selected.Whereas the existing SRM only checks the similarity based on string similarity and only in single dimension.
Comparative analysis on the results of SRM and MRM shows that the performance of MRM is better than the SRM on ACE 20202, Aquaint and Sarcasm datasets while the SRM performs better on HUA and UoA datasets.The results of MRM on UoA and HUA datasets are low which is due to different domain knowledge (medical) of the datasets.

V. DISCUSSION
The multidimensional reference model has been developed for the domain knowledge.MRM helped in expanding the domain knowledge using the linguistics word categories such as lexical, semantic, and syntactic.In this research MRM was tested in multiple rounds (1-5 batches) on heterogeneous datasets from diverse domain.It enhanced the coverage of words and helped in term harmonization.The MR contains the 37321 words as a rich template in form of domain knowledge /data dictionary.List of words from lexical, semantic, and syntactic clues containing mrm-score() have been formed.
Evaluation of MRM on different datasets is performed using similarity score (in percentage).If similarity score is high, it means the more root words are matched with input words.From all the experiments it shows the proposed work is more scalable and it includes more similar words on basis on mrm_score().With that, it has been observed that the matched terms for the ACE2020, Aquaint, Sarcasm have been covered more than that of UoA and HUA datasets.This is due to the fact that the ACE2020, Aquaint, Sarcasm covers daily life routine whereas the UoA and HUA contain data of medical domain.

VI. CONCLUSION
During the literature review and aiming to find solutions to solve the data heterogeneity, it was found that the only possible solution to solve the problem is to harmonize data.By adopting many techniques such as semantic, lexical matching and reference matching template.Based on that, a reference model which was developed by [14] for data curation framework for medical cohort taken as baseline study.In that, the reference model (SRM) contains the domain knowledge of specific terms that were used in medical domain.The performance of MRM has been evaluated on five heterogeneous (structured, semi-structured and unstructured) datasets and in five multiple rounds.The results of each rounds of ACE20220, Aquaint, Sarcasm, UoA and HUA show better performance of MRM over its counterpart reference model i.e., Single dimensional reference model.The overall performance of MRM on all participating datasets is more than 30% on ACE2020, Aquaint, and Sarcasm datasets whereas the performance of UoA and HUA performed better on SRM.To

FUTURE RECOMMENDATIONS
Use of other categories of linguistics and computational linguistics for further improvement in the field of English grammar.
The categories mentioned in MRM are context clue, semantic and syntactic.Context clues are further classified into synonym and antonym.Sample words and their score are presented in Appendix C and D. The second and most important category used in MRM for indexing the linguistics words is semantics.It plays a vital and significant role in understanding the information related to datasets.
It's important to highlight here clearly that the categories of MRM such as contextual, semantic, and syntactic clues and their score (as shown in Appendix C-H) helped in developing the multidimensional (indexed based) reference model.The MRM provides the input to the section that performs the data harmonization process.The section contains terminology extraction, rules definition, lexical matching and semantic matching which are responsible for producing data harmonization report and harmonized dataset.III.METHODOLOGYFor development of multidimensional reference model, following four steps have been taken (i) defining the generated tokens (ii) identifying the root word (iii) Determining the dimensions (iv) aggregating the dimensions root word.These steps are also shown in the Fig.3.… www.ijacsa.thesai.org

Fig. 4 .
Fig. 4. Number of succesfully matched terms for MRM and SRm on batch 1.

Fig. 5 .
Fig. 5. Percentage of matched terms for MRM and SRM on batch 1.

Fig. 7 .
Fig. 7. Percentage of matched terms for MRM and SRM on batch 5.
SRM MRM www.ijacsa.thesai.orgconclude with the performance of MRM, it has been observed that the use of MRM supports the DHF in selection of key terms based on semantic and lexical matched terms.Design and development of Multidimensional Reference Model which is developed based on the linguistics categories such as context, semantic and syntactic clues.The model enables the use of indexing for any English sentences by introducing the words and their respective score.The proposed MRM produced huge number of words that can be used as a reference for any general domain which contains daily basis data generated in textual formats.