Identity Attributes Metric Modelling based on Mathematical Distance Metrics Models

Internet has brought a lot of security challenges on the interaction, activities, and transactions that occur online. These include pervasion of privacy of individuals, organizations, and other online actors. Relationships in real life get affected by online mischievous actors with intent to misrepresent or ruin the characters of innocent people, leading to damaged relationships. Proliferation of cybercrime has threatened the value and benefits of internet. Identity theft by fraudsters with intent to steal assets in real space or online has escalated. This study has developed a metrics model based on distance metrics in order to quantify the credential identity attributes used in online services and activities. This is to help address the digital identity challenges, bring confidence to online activities and ownership of assets. The application forms and identity tokens used in the various sectors to identify online users were used as the sources of the identity attributes in this paper. The corpus toolkits were used to mine and extract the identity attributes from the various forms of identity tokens. Term weighting schemes were used to compute the term weight of the identity attributes. Other methods used included Shannon Entropy and the Term Frequency-Inverse Document Frequency scheme (TF*IDF). Standardization of data using data normalization method has been applied. The results show that using the Cosine Similarity Measure, we can identify the identity attributes in any given identity token used to identify individuals and entities. This will help to attach the legitimate ownership to the digital identity attributes. The developed model can be used to uniquely identify an online identity claimant and help address the security challenge in identity management systems. The proposed model can also identify the key identity attributes that could be used to identify an entity in real or cyber spaces. Keywords—Mathematical modeling; Cosine Similarity Measure; text frequency; inverse document frequency; cyber space; term weight; internet; digital identity; trust model; normalization; text mining


I. INTRODUCTION
Challenges of identifying internet users associated with valuables that are online have become a serious concern to internet users. The adverse challenges on information security regarding identification of real identity ownership on internet and to services and online activities is of great concern. This research has developed a metrics model based on distance metrics in order to quantify the credential identity attributes used in online services and activities. The model will help in improving cyber security in digital identity management.
This study has reviewed literature that is relevant to the work so as to establish what passed efforts in this area have covered. Areas that have been explored include effects of internet to society and studies that help to understand what identity is from various disciplines. Various forms of identities have been considered which form partial identities, these would have an impact on identity of a person or entity. Consideration of what identity would imply on online services and activities has been looked at so as to have a relevant context in this study. Digital identity is an aspect that is dependent on trust; it is imperative to reflect on trust framework so as to bring to the fore on how the digital identity and trust are inter related. A large part of online activities includes communication of information; we therefore, had to reflect on communication trust model which would be applicable to our study and see the value that it would add to our study. We have reflected on Shannon's Communication trust framework from Shannon's Information theory to guide us in considering digital identity with respect to trust in online activities. Since our work is premised on mathematical modeling, it was imperative that we draw our attention to mathematical modeling and how it could influence our work. This research includes text mining from different documents, the mining would give outcomes that would include errors on data from different backgrounds of the different documents, whose sources are varied. To remove errors which at time would be due to measurement units, noise, and estimations, standardizing of data would be important before we use it in our metrics.
Mathematical modeling has been used in science in finding solutions in real life problems, this study takes interest in mathematical modeling. Using a mathematical model, a solution is being proposed to attend to the challenges that have been encountered in cyberspace concerning digital identity security. The study will use the proposed model on mined data to quantify the identity attributes. We will use the model to verify identification of the owner of digital identity; we will further test the model and establish which identity attributes are key in a given corpus for the identification of an identity claimant.
Literature that was reviewed showed that vector space model uses a storage matrix where columns represent the documents in a collection and rows represent terms in a document. Term frequencies of a given document would help us establish important identity attributes which would identity an entity. Literature indicated that there is a variety of schemes www.ijacsa.thesai.org on term weights (attribute importance) which would help us establish whish terms are important in a given token of identity. Some of the schemes (or information retrieval methods) include Shannon's entropy and Term Frequencies -Inverse Document Frequency (TF-IDF). Our interest is to develop a digital identity model that would supply trusted digital identities. The other literature that was reviewed was that on multifactor authentication systems on identity attributes metrics models. This was to help us consider efforts that have been used in the past on augmented efforts. Literature on International Standards regarding identity attributes and identity tokens to appreciate the value this would have on our work was considered. This was to establish the international standards that affect identity attributes and identity tokens which are subject of our study.
Identity attributes were mined from identity documents and application forms for identity enrolment. Such documents in PDF format were extracted from internet using TalkHelper PDF Converter. Text was then mined from these documents using AntConc 3.5.8, a corpus analysis toolkit for data mining. To remove error, data was normalized for standardization of data.
The proposed model was used for identity attribute quantification and verification. The proposed model was also used to determine term importance in the corpus. Distance metrics has been the basis of our model to quantify the identity attributes. The model would identify attributes that are very key as identifiers of an entity, in other words, these are attributes that can closely identify an entity in online activities. Results of our study have been given and conclusion of the study has been drawn.

A. Adverse Effects of Internet
The rapid development of information and communication networks by governments, colleges, enterprises and individuals means that they are employing more and more information systems without clear distinctions of the persons and devices behind their use [1]. It is obvious that the need for identity that would provide complete privacy is vital [1]. It has been established that cybercrime has become one of the fastest growing crimes in the world [2]. Study has showed that computer networks are subject to attacks from malicious sources, with the advent and increasing use of internet attacks are most commonly increasing [3]. In 2007 it was reported that, "in Australia alone the proceeds of identity theft, [was] still one of the largest sources of fraud, [and was] estimated to be nearly $6 billion a year [4]". Identity theft is one of the fastest growing crimes in the world. Security includes protecting individuals, organizations, devices and infrastructure from identity theft, unauthorized data sharing and human rights violations [5]". When devices are lost or stolen, all of the data stored on or accessible from the mobile device may be compromised if access to the device or the data is not effectively controlled [6].

B. Partial Identity
To appreciate identity, we need to consider that a wholesome identity is formed by partial identities. A person may have different identities according to the context in which the identity is applied. For instance, a researcher may be a father, magazine columnist, human right activist, sportsman, politician, philanthropist, friend, and lecturer. He is identified differently, and attributes that make him identified accurately may differ from one context to the other. Fig. 1, illustrates partial identities. A comprehensive identity could be assumed by identifying key characteristics of an individual which we would attribute to be identity attributes.
Identity encompasses all the essential characteristics that make each human unique [3]. An identity of father of this individual may have characteristics of: father of three, kind, loving, hardworking, protective, supportive, merciful, jovial, progressive, etc. The identity of a person comprises a large number of personal properties [3], as indicated above. These properties help to uniquely identify an individual.

C. Digital Identity
It is indicated in [7] that "a digital identity is a virtual representation of a real identity that can be used in electronic interactions with other machines or people". An identity consists of traits, attributes, and preferences upon which one may receive personalized services". E-services require an effective way to manage digital identity information of the users [7].
Windley defines a digital identity as the "data that uniquely describes a subject or an entity and the ones about the subject's relationships to other entities [8]". Further, Windley states that a digital identity is "the persona that an individual presents across all the digital spaces [8]". In [9], we define digital identity as the "electronic representation of personal information of an individual or organization (name, address, phone numbers, demographics, etc.)".
We discover that "in the digital world a person's identity is typically referred to as their digital identity [9]". It is argued in [10] that "identity encompasses all the essential characteristics that make each human unique". Satchell et al. indicated that "identifiers of a respective individual or entity would identify the entity online, from any context of the identity. An identifier uniquely identifies an entity (a person, a computer, an organisation, etc.) within a specific scope [11]". This underscores that digital identification is key in online activities of an entity on internet or computer network.

E. Mathematical Modeling
Haines and Crouch (2007) characterize "mathematical modeling as a cyclical process in which real-life problems are translated into mathematical language, solved within a symbolic system, and the solutions tested back within the reallife system [21]". This demonstrates how mathematical modelling can present a mathematical model that would help in solving a real life situation using mathematics. It is the interest of this research to establish a model that would help in presenting a solution to the problem of this research using a mathematical model. "Mathematical models comprise a range of representations, operations, and relations, rather than just one, to help make sense of real-life situations [22]".

F. Data Standardization
"We often want to compare scores or sets of scores obtained on different scales [23]". Standardizing data that comes from different sources would help us to "eliminate the unit of measurement by transforming the data into new scores with a mean of 0 and a standard deviation of 1. Considering that this research has interest to compare with the performance of other metrics, it is prudent that we have a common ground of comparing the performance of the metrics. We transform data "to improve our ability to discover knowledge [24]"; this transformation "includes normalising data [24]." Olson and Delen in [25] indicate that "the main advantage is to avoid attributes in greater numeric ranges dominate those in smaller numeric ranges. Another advantage is to avoid numerical difficulties during the calculation." It was noted that "normalization may improve the accuracy and efficiency of mining algorithms involving distance measurements [26]". We discover that "a direct application of geometric measures (distances) to attributes with large ranges will implicitly assign bigger contributions to the metrics than the application to attributes with small ranges. The attributes should be dimensionless because the numerical values of the ranges of www.ijacsa.thesai.org dimensional attributes depend on the units of measurements and, therefore, the choice of the units of measurements may greatly affect the results of clustering. One should not use distance measures without normalization of data [27]".

III. RELATED WORKS
Campbell et al. state that "in the simplest case, the components of [the sparse] vectors are the raw frequency counts of each term in each document [28]". They also observed that "search engines of the World Wide Web (www) are based on certain information retrieval models like Boolean model, Probabilistic model, and Vector space model [28]". Our interest is in the vector space model; Campbell et al. indicate that "the main purpose of [information retrieval models] is to retrieve relevant documents specific to a search [28]". It was observed that "[vector space model] uses a storage matrix where columns represent the documents in a collection and whose rows represent the term frequencies among the documents [28]". They also stated that "For ad-hoc querying, dynamic queries are compared against a static document database in order to find documents closest to the query [28]". Simplistically speaking, a search engine has "static database of documents, a query processor, to convert incoming (dynamic) queries into a format compatible with the representation model, and a relevant measure to compare converted queries against documents [28]". The researchers indicate that "when conducting a query, one method is to search through the storage matrix and match the query terms with row terms producing the document closest to the query [28]".
Researchers have established that "Shannon's entropy method is one of the various methods for finding weights [29]". It has been observed that "multiple attribute decision making (MADM) refers to making preference decisions (e.g., evaluation, prioritization, and selection) over the available alternatives that are characterized by multiple, usually conflicting, attributes [29]". It was observed that "since each criterion has a different meaning, it cannot be assumed that they all have equal weights, and as a result, finding the appropriate weight for each criterion [29]". They discovered that "in MADM the greater the value of the entropy corresponding to a special attribute, which imply the smaller attribute's weight, the less the discriminate power of that attribute in decision making process [29]".
It is indicated in [29] that "the raw data are normalized to eliminate anomalies with different measurement units and scales. This process transforms different scales and units among various criteria into common measurable units to allow for comparisons of different criteria". It was showed in [30] that "the entropic-weight method, from Shannon's entropy theory, was applied for the purpose of obtaining a classification". Vajapeyam, summarizes "Shannon's entropy [as] a direct measure of the number of bits needed to store the information in a variable, as opposed to its raw data [31]". He adds that "entropy is a direct measure of the 'amount of information' in a variable [31]".
Inambao et al. came up with a digital identity model that would "supply trusted digital identities [32]"; the model would "identify and extract various forms of identity attributes from various forms (identity tokens) [32]". The model was established on Euclidean Distance metric based on Euclidean geometry. This model identified attributes that were very key as identifiers of an entity, in other words, these are attributes that can closely identify an entity. This model helps in "quantifying, implementing, and validating of the attributes from application forms (or identity tokens) [32]". Chinyemba and Phiri [33] showed "how to secure biometric data whilst at rest and or in motion so as to deter attackers in public organizations". Biometric identification contributes immensely to a person's identification and can therefore, contribute to the collection of digital identity attributes for individual identification. Ibou et al. indicated that "attribute-based digital identity modelling [needed] to take into account privacy issues [34]" and "proposed [a] model [that] takes into consideration three fundamental aspects, namely security, privacy and identity theft [34].
The work of Phiri et al. introduced a "multifactor authentication system based on two identity attributes metrics models [35]". This broadens the scope of digital identification in an Identity Management system; we could have different modes of identification to make the digital identification robust and effective. Strengthening of the security of digital identity would include the developing of multi-modal authentication. This would include a combination of different authentication methods. For instance, like in the case of "when using an ATM bank card, in addition to the PIN number the user may be requested to submit a biometric feature such as a fingerprint in order to withdraw a certain amount of money above a given limit. A combination of biometrics, token based credentials and pseudo metrics will most likely form a very effective defense against imposters [35]". The researchers where hoping that "an additional fourth category of inputs would take into account identity attributes such as the name, date of birth, address and other acquired identity attributes for consideration [35]". Our research efforts are building on these past research work.
The work of Phiri et al. introduced a "multifactor authentication system based on two identity attributes metrics models [36]". They argued that this would "reduce the cases of cybercrime since it becomes difficult to forge all the proposed four authentication factors that include biometrics [36]". They went on to demonstrate "the performance of the three fuser block technologies namely Artificial Neural Networks (ANN), Fuzzy Inference System (FIS) and Adaptive Neuro-Fuzzy Inference System (ANFIS) using the term weight and entropy identity attributes metrics.
The current research was given birth by this work of Phiri et al. as indicated in the close of their work indicating that they considered the "future works [would] look at other combinations of the authentication factors and metrics modelling methodologies [36]".

IV. RESEARCH METHOD
Consulting Creswell [37] indicates that this study is quantitative in nature and therefore, a survey to inquire into perceptions of observers was planned to use a questionnaire that would attend to these perceptions. As this study is quantitative in nature, extensive literature in quantitative studies was reviewed. Previous works that have applied the www.ijacsa.thesai.org areas that have a bearing on this research with quantitative techniques applied were reviewed. Quantitative data was analyzed with the help of spread-sheets (e.g. Microsoft Excel). The techniques that have been used include data mining techniques and statistical analysis. PDF application forms for identity token requesting for identity attributes of individuals, within the research sample, were extracted from internet. The identity attributes were drawn from a list of internationally identified identity attributes by the International Standard Organization (ISO).
Documents in PDF format from the corpus of the Government of the Republic of Zambia (the researcher's residence) documents were searched and harvested from the internet. To test the proposed model, we got a set of application forms for identity token at random from our selected area. We picked ten (10) documents from the pdf documents out of 32 documents that were extracted from internet. Our model is focused on identifying the set of attributes that would identify a claimant of a digital identity that sufficiently matches the entity to be identified. Matching of a claimant could be done on one claimant or multiple claimants. In simple terms from our documents, if one document represents a token that owns the digital identity which is being claimed by the claimant, we can compare the attributes of digital identity of this entity to those of the claimant. For the purposes of this research, the ten documents will suffice, of which one would be the object and nine others will be the claimants of the digital identity. All the ten documents were tested on the metrics in the proposed mathematical model.
As indicated, ten (10) documents were picked from the Government of Zambia sets of documents. These documents are listed in Table I.

A. Identity Attribute Text Mining
Literature for International Standard Organization was consulted to identify attributes that are recognized as standard in the enrolment of diverse online services. Therefore, identified attributes by International world standards, ISO/IEC JTC 1/SC 27, were considered and used in this research. These standards have identified a list of attributes that could be collected from individuals during the time of enrolment for digital services of individuals; "Validation can occur during Identity Proofing, Identity Information Verification and Verification" [38] regarding entities from identity tokens. A list of attributes from ISO/IEC JTC 1/SC 27 indicates elements that would form identifiers to identify an individual, these are shown in Table II.
Tokens of identity are equally identified by this ISO standard. The identity documents and service enrolment application forms are documents that fell in the category of the international standards of ISO/IEC 29003:2013 [38]; These documents, according to the research samples, were searched from the internet and obtained in PDF format. TalkHelper PDF Converter version 2.2.9.0 tool was used to convert documents into PDF, for documents that were in other formats other than PDF. Documents which were already in PDF needed no format conversion. Fig. 5 shows TalkHelper PDF Converter that we used in this study. Documents in PDF format were then converted into text files (using TalkHelper PDF Converter version 2.2.9.0) in readiness for text mining. AntConc 3.5.8, a corpus analysis toolkit was used for text mining. This tool was used to get the text frequency of the corpus files from different industries and regions, as discussed above, that were imported into the tool from respective folders. Fig. 4 shows the tool that was used for text mining.
Each identity attribute had its term frequency recorded as indicated, from corpus analysis toolkit. Text mining was done on these documents using the same techniques as discussed above, based on the nineteen (19) existing attributes that we have been using. Table III shows the term frequencies (Tf) of each of the respective attributes after text mining.  "As the time passes, a lot of information and new challenges related to information acquisition and data mining are emerging very rapidly [39]". Efforts of curbing online risks ought to match the rapid growth of technology and online services.

V. PROPOSED MODEL
The proposed model Identity Attribute Metric Model based on the Distance Metrics in this research is the Cosine Similarity measure.

A. Model Quantification
A cluster is a collection of data objects that are similar to objects within the same cluster and dissimilar to those in other clusters. Similarity between two objects is calculated using a distance measure [40]. Charulatha et.al indicate that "Clustering is the grouping of similar instances/objects some sort of measure that can determine whether two objects are similar [26]". As pointed out by Backer and Jain, "in cluster analysis a group of objects is split up into a number of more or less homogeneous subgroups on the basis of an often subjectively chosen measure of similarity (i.e., chosen subjectively based on its ability to create 'interesting' clusters) [34]". "From the scientific and mathematical point of view distance is defined as a quantitative degree of how far apart two objects are [41]." Researchers note that "it is natural to ask what kind of standards we should use to determine the closeness, or how to measure the distance (dissimilarity) or similarity between a pair of objects, an object and a cluster, or a pair of clusters [34]". "In order for the distance metrics to make sense, good data transformation or normalization is required. In data normalization methods, the objective is usually to ensure that the computed distance metric or similarity measure will reflect the inherent distance or similarity of the data [42]".
When documents are represented as term vectors, the similarity of two documents corresponds to the correlation between the vectors. This is quantified as the cosine of the www.ijacsa.thesai.org angle between vectors, that is, the so-called cosine similarity. Cosine similarity is one of the most popular similarity measure applied to text documents, such as in numerous information retrieval applications and clustering too [42]. An important property of the cosine similarity is its independence of document length. For example, combining two identical copies of a document d1 to get a new pseudo document d2, the cosine similarity between d1 and d2 is 1, which means that these two documents are regarded to be identical. Given another document d3, d1 and d2 will have the same similarity value to d3 [42] as shown in equation (1).
Documents with the same composition but different totals will be treated identically. When the term vectors are normalized to a unit length such as 1, and in this case the representation of d1 and d2 is the same [42].
Cosine similarity measure has a high positive correlation than the Euclidean Distance [43]. The cosine of 0 0 is 1 and it is <1 for any other angle. It is thus a judgment of orientation and not magnitude: two vectors with the same orientation have a cosine similarity of 1, two vectors at 90 0 have a similarity of 0 and two vectors diametrically opposed have a similarity of -1, independent of their magnitude. Cosine similarity is particularly used in positive space where the outcome is nearly bounded in [0,1]. Cosine similarity is particularly used in positive space where the outcome is nearly bounded in [0,1] [43]. Cosine similarity gives a useful measure of how similar two documents are likely to be in terms of their subject matter [43]. This distance metric will give us a number from the closed interval [0, 1], 0 denoting that the two vectors are overlapping and 1 denoting that there is an angle of 90 ° which is the highest difference between the vectors [44].
Cosine Similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of an angle between them [44]. It can be derived using the Euclidean dot product. Given two non-zero vectors, "x" and "y", the dot product of the two vectors would be represented by This will translate to This also agrees with trigonometry and complex numbers; given two vectors, x and y in a vector space, the Cosine of the angle (θ) between these two vectors would be represented by the equation above.
Given two vectors X and Y, the Cosine Similarity, Cos (θ) is expressed as a dot product and magnitude as  These two vectors could be that "X" is a set of attributes that of an applicant who claims ownership of the identity attributes while "Y" could be verifier identity attributes. The Cosine function in equation (3) can be represented as a Similarity distance measure in equation (4) as is also indicated in [45].

B. Identity Verification
The choice of this model was based on two considerations that could be applied in this study:

1) For verification of ownership
a) When we are specifically interested in attending to one applicant for verification of ownership claim of a particular digital identity with known identity attributes. b) When multiple applicants make claims of ownership claims of a particular digital identity with known identity attributes and we need to verify.
2) The principle of orientation of two similar vectors in a metric space that is inherent with the cosine Similarity distance.
Cosine Similarity measure is used in data mining as a technique for documents that are similar based on the text that these documents contain. For instance, this metric is used in considering those who share same tags on a blog, persons who viewed same documents, customers who bought similar items online.
Verifying online identity for claimants could help establish who the legitimate owner would be from a multiple of identity claimants. We could use the metrics and mathematical computations to achieve this. Therefore, this model can be used in the verification process of an applicant or applicants in the Digital Identity Management System.

C. Testing the Model
For us to identify the hierarchy of importance of attributes in the corpus, we need to consider the term weight of each attribute within the corpus of the ten (10) documents. We have represented the ten (10) documents in our functions as d1, d2, d3,…d10. The general expression of di, represents the same ten documents ranging from d1 to d10.
Chen and Chang indicate that "TF and TF-IDF are widely applied to count the weight of a term" [43]. They further add that "TF represents the number of times a term occurs in a document, and TF-IDF is the combining of TF and IDF weights. IDF indicates the general importance of a term in overall documents" [43].
Researchers indicate that Term frequency (Tf) factor is represented by the "logarithm of the term frequency to scale the effect of unfavorably high term frequency [44]". This is expressed as.

D. Indeterminate Considerations
It is important to recognize that the function runs into indeterminate when tf becomes zero (0) since log 0 = ∞ and www.ijacsa.thesai.org We therefore, evaluate this part of the function; we have a logarithmic property that for any n = 1, 2, 3, … we have −1 ≤ ( ) ≤ − 1 (6) Therefore, −1 ≤ log ≤ − 1 (7) It follows that the upper bound of log x is x -1 Therefore, replacing tf in the function TF= 1 + log we have For x = 0, we have TF= 1 + (0-1) = 0 The Inverse Document Frequency component (IDF) of the function is expressed when we "multiply original tf factor by an inverse collection frequency factor (N is the total number of documents in a collection, and ni is the number of documents to which a term is assigned) [43]".
It was indicated in [46] IDF can be calculated by idf = ℎ ℎ (10) This is represented by the expression: This function will be indeterminate when ni = 0. We observe that log = log − log (12) In our corpus, N = 10. We could have situations when ni = 0; at that point then our function would become indeterminate.
That is, IDF = log 10log 0 = log 10 (13) From our established statement above, in (7), it therefore follows that IDF = log 10 -( − 1) When x = 0, then we have IDF = log 10 -( 0 − 1) = log 10 +1 (15) Table IV represents the term frequencies (TF) of the corpus. The functions in Table IV, Table V, and Table VI for the term frequencies Tfi and idfi, have their indeterminate logarithmic functions resolved and therefore, present the outcomes of the functions.
In 1993 Buckley stated that "over the past 25 years, one class of term weights has proven itself to be useful over a wide variety of collections. This is the class of tf*idf (term frequency times inverse document frequency) weights [47]". "TF-IDF is also one of the most popular term-weighting schemes for user modeling and recommender systems [48]".
Considering the TD-IDF term weight scheme, from our findings above, we would have the weighting computational outcomes to be as indicated in Table IV. The metric would be represented by: The terms of the functions have been explained above. We obtain the weighting of the attributes (terms) by considering the function (16) above of which the outcomes are indicated in Table IV.

E. Term Importance
Jiao et al. established that "a classic way to assess the importance of a term is the so-called tf-idf (term frequencyinverse document frequency) term weighting scheme [49]". They further indicated that the term importance "is based on two assumption: a) idf assumption: rare terms are more informative than frequent terms, b) tf assumption: multiple occurrences of a term in a query document are more relevant than single occurrence [49].
After sorting the outcomes of the computations of the weighting in Tf*idf we are able to arrange in order of which attribute is more important than the others.

F. Euclidean Distance based Similarity
Past efforts [32] have showed that Euclidean Distance Geometry could "improve the authentication in digital identity management system and particularly improve the security in digital financial services".
The Euclidean distance between two points or terms (t1 and t2), from a corpus, in a two dimensional space is represented by the function.  Table V shows the index document frequencies (IDF) in our term weighting function. Table VII shows the rating of the terms on which weighting has been applied. This rating indicates which identity attributes are most important in identifying a digital identity claimant against online interests in this corpus. We are interested to see which identity attributes are key in identifying an identity claimant.

C. Model Based on Cosine Similarity Measure
In order to demonstrate the effectiveness of our proposed model, we would need to apply our model on the dataset which considered the weighting of the attributes. The results of these metrics have been recorded in Tables VIII, IX, and X.

D. Verification of Ownership
For the purposes of verification of ownership of the attributes by an online user, we will assume that the object of ownership is the user of document 2 from our corpus of ten documents. Document 2 was purposed to capture attributes of a people who would apply for residence permit. It is only an individual who has entered responses that match the attributes of the specific individual that would be said to be said to be uniquely similar. For the sake of assessment of key attributes, we would consider the attributes involved in identifying the digital identity of our object and compare with the other attributes from the other nine (9) documents. We are going to look at the attributes of the second document and compare them to each of the documents of the nine other documents, respectively. Using our proposed model of the Cosine Similarity measure we would then observe the performance on similarity of the attributes of the second document to those of the other nine.
We therefore standardize the data using term weights and repeat the computations as above and record the results. The results are reflected in Table IX.  Our main interest is to identify the text from the documents that would be the best identifier of the online user. The details of the digital object of an applicant of identity and verification, which in our case is represented by the identifying attributes, www.ijacsa.thesai.org would need to accurately match attributes of verification. We therefore, consider the importance of attributes that is in the corpus of ten documents. Table IX shows the documents that are sorted in the order of importance; in this case, the documents would represent the applicants that are being subjected for verification by the process of authentication.
From Table IX, we see that it was important to normalize the Term frequencies from the documents so as to remove the errors from data. Without normalizing the data, we have the rating of the document affected to a point that the document compared to itself shows deficit in the content of terms. Removing the errors through normalization done by term weighting of the data from the corpus of the ten documents gives the rating where document 2 is compared to itself becomes first in rating. This is the natural expectation of the outcome of this process.
We have just established that when an online application or applications from multiple users for authentication, Cosine Similarity measure could help us to accurately identify who the true owner of the digital identity would be. This indicates that Cosine Similarity measure could be a very strong tool in information security to add another level in authentication. Coupled with other techniques, we could build a robust system in information security for Digital Identity management. Table X shows the top ten identity attributes from the ten documents where TF*IDF term weighting was applied. Picking identity attributes that have been found to be higher in terms of weighting would help us identify the owner of the identity attributes for online identity claimant. Applying developed Identity Attribute Metrics, which was developed using the Cosine Similarity measure we obtain the following results:  Testing the proposed Cosine Similarity measure as an Identity Attribute Metric Modeling is able to identify the document that uniquely has its identity attributes similar to itself as the highest rated and hence identify a claimant of the digital identity as the legitimate owner. This model would be able to identify the true owner claimant from one to multiple claimants of the digital identity. This would help in improving security on identifying the legitimate digital identity owner of a specific identity. Only such an owner should the given access to online assets, services, or attention.

G. Results on Metrics Model
It was observed that the identity attributes from the ISO list were based on physical identification of an individual claimant. The study showed that using Cosine Similarity measure, the legitimate owner of the digital identity would be uniquely identified with the top most score in the computations. To achieve this, mined text of digital identity would need to be normalized, in this case we used the term weight to normalize the data. The calculations with the model give best results on term weighted text. It was also observed that there was a set of digital attributes would score higher than others when we apply our model. After sorting the results of the model on the weighted text mined identity attributes, it was observed that the identity attributes that locate residence of a claimant was of paramount importance. It was equally observed that the identifying names of the claimant, national identification, economic activity, and contacts of the claimant were ranked high in the results of our computations.

VIII. CONCLUSION
The proposed model was able to identify the legitimate owner of the digital identity attributes and therefore, able to show who the false-identity online claimants were. The model was also able to identify the attributes that were key in identifying the legitimate owner of the claimed identity, in other words, the most important attributes to distinguish the legitimate owner from the false ones could be identified using this model. The identity attributes can be extracted from identity tokens by mining identity attribute text using data mining tools. The study has been able to develop an identity attribute metrics model using the Cosine Similarity distance measure and show that Cosine similarity measure can be used to quantify the identity attributes. The model has been tested on data that was mined and standardized using term weights; the outcome showed that the Cosine Similarity model can identify the unique owner of the digital identity attributes. The model also showed that it could identify a legitimate identity claimant from multiple claims. This model could add value to enhancing security in online activities by validating the true owner a digital identity. This model could also be used in multi modal tools for a robust online digital solution to arrest the challenges of online information security.