An Iterative, Self-Assessing Entity Resolution System: First Steps toward a Data Washing Machine

Data curation is the process of acquiring multiple sources of data, assessing and improving data quality, standardizing, and integrating the data into a usable information product, and eventually disposing of the data. The research describes the building of a proof-of-concept for an unsupervised data curation process addressing a basic form of data cleansing in the form of identifying redundant records through entity resolution and spelling corrections. The novelty of the approach is to use ER as the first step using an unsupervised blocking and stop word scheme based on token frequency. A scoring matrix is used for linking unstandardized references, and an unsupervised process for evaluating linking results based on cluster entropy. The ER process is iterative, and in each iteration, the match threshold is increased. The prototype was tested on 18 fullyannotated test samples of primarily synthetic person data varied in two different ways, good data quality versus poor data quality, and a single record layout versus two different record layouts. In samples with good data quality and using both single and mixed layouts, the final clusters had an average F-measure of 0.91, precision of 0.96, and recall of 0.87 outcomes comparable to results from a supervised ER process. In samples with poor data quality whether mixed or single layout, the average F-measure was 0.78, precision 0.74, and recall 0.83 showing that data quality assessment and improvement is still a critical component of successful data curation. The results demonstrate the feasibility of building an unsupervised ER engine to support data integration for good quality references while avoiding the time and effort to standardize reference sources to a common layout, design, and test matching rules, design blocking keys, or test blocking alignment. Also, the paper proposes how unsupervised data quality improvement processes could also be incorporated into the design allowing the model to address an even broader range of data curation applications. Keywords—Unsupervised entity resolution; data curation; frequency blocking; entropy regulated; data washing machine


I. INTRODUCTION
As organizations ingest and process larger amounts of data, the time and effort it takes to prepare and integrate data into useful products are also increasing, and many researchers are working to alleviate this bottleneck using several different approaches [1], [2], [3]. The root cause of the time delay is human supervision of the curation steps including data quality analysis, data cleansing and standardization, entity resolution (ER), and data integration [4]. The goal of ER is to link two references if, and only if, the references are equivalent [5], [6]. The problem is only exacerbated by Big Data [7], [8]. Because of the time delay between receiving data and its availability for use, data analysts often face the choice of waiting for the preparation to be complete, or to by-pass the curation process and engage in their attempts at data preparation which may or may not follow the best practices.
Many organizations are beginning to recognize this time and effort gap between data ingestion and final information product, and are moving to remedy this situation by increasing the level of automation in data curation processes [9]. These organizations along with software vendors and university researchers are trying to understand how to apply the same AI and ML techniques used for the data analytics at the end to the automation to the preceding data preparation processes [10], [11]. While many of these employ AI and ML [12], [3], [13], they still largely rely on some level of standardization in the source data. The ultimate goal is to develop systems for unsupervised data curation (UDC) which are metadata agnostic and can directly ingest and process raw data. The objective of UDC is to develop methods and techniques to process data at scale and successfully produce information products without manual intervention. Key components of the data curation process and prime targets for automation are the largely manual processes of data quality analysis, building transformation for data cleansing and standardization, and developing and testing rules for entity resolution and data integration (fusion).
UDC has been likened to a "data washing machine" [14]. When using a household washing machine for laundry, the user first loads the dirty laundry, and detergent then selects the cycles. The washing machine automatically executes the cycles, and in the end, produces clean laundry. Similarly, the user of the data washing machine loads dirty data with appropriate reference data, then selects the data cycles (control parameters). The data washing machine then executes the cycles to produce clean data (an information product) appropriate for use in a particular application. www.ijacsa.thesai.org The focus of this research is to describe a proof-of-concept (POC) prototype to serve as both a starting point and a foundation upon which a more complete UDC can be built [15]. The primary goal is to develop unsupervised methods and techniques for both data cleaning and data integration (ER) capable of operating at scale. The current code for the POC described in this paper can be found at https://bitbucket. org/Awaad_Al_Sarkhi/dwm-datawashingmachine/src/master/ II. A PROOF-OF-CONCEPT (POC) FOR UNSUPERVISED DATA CURATION (UDC) The purpose of the POC is to demonstrate the feasibility of cleaning and integrating entity references in an automated fashion for certain types of data and certain phases of the curation process. The primary use case addressed by the POC is "multiple sources of the same information" as described in [16] as one of ten root causes of data quality problems. The novelty of the POC is it attempts to perform unsupervised entity resolution (ER) first rather than data cleaning, the opposite of most supervised processes. The objective of the POC is to minimize human intervention to analyze and transform the data and still obtain usable results as measured by the accuracy of clustering, i.e. a working data washing machine for data deduplication.
The POC for the data washing machine was written in Python and Java and uses frequency-based blocking, a multitoken scoring matrix as its ER matching process, and entropybased quality evaluation of clustering [17] , [18] , [19], [20] The assumptions of the POC are  The input to the process is a text file in a commaseparated values (CSV) format.
 Each text line is a reference to the same type of entity such as person entities (patients, customers, students), business entities, or materials (product listings, machine parts).
 The references are not assumed to be standardized with a uniform metadata tagging. No metadata is used in the POC process. Any metadata in the form of a header record is discarded.
 The first string value in each text line is a unique reference identifier.
To facilitate experimentation with various unsupervised techniques, the POC was developed as a series of subprocesses or phases to facilitate experimentation. Currently, phases have been implemented, and the fourth phase for token correction is under development. The organization of this paper is as follows:  Phase I: Punctuation removal, upper casing, and tokenization.
 Phase II: Global standardization (replacement) of nonnumeric tokens at the file level.
 Phase III: Removal of stop words, blocking, and clustering of equivalent references (entity resolution).

A. Phase I -Tokenization
The first Phase reads each reference as a line of text and performs a series of operations. The first is to separate the reference identifier, convert all letters to uppercase, and replace the field delimiters (typically a comma) with a blank character. Next, all non-word characters (\W) are replaced. For experimentation, two methods of replacement for non-word characters were tried. In the first method called "Compress," the non-word characters are replaced by a null character. For example, if a field has the value "123-456", then after replacing the hyphen character with a null character is becomes the single string "123456". In the second method called "Splitter," each non-word character is replaced by a blank character. The same example "123-456" becomes two strings (tokens), "123" and "345".
The motivation for the Compress method was to transform characteristic values with punctuation such as telephone numbers and dates into a single string. Interestingly, for the data used for the initial validation of the POC, the Splitter method generally gave better results than the Compress method.
In addition to non-word character replacement, upper casing, and tokenization, the first Phase also has an option to de-duplicate tokens. If the duplicate token option is employed, any duplicates of tokens within the same reference are removed, otherwise, duplicates are left in the reference. In the end, the cleaned tokens from each reference are reassembled into a blank delimited string and written to the tokenized reference file.

B. Phase II -Global Token Replacement
Phase II attempts an unsupervised correction of misspelled tokens based on the token frequency and string similarity. The replacement uses the assumption, if a high frequency, the nonnumeric token is very similar to a low-frequency, non-numeric token, the low-frequency token is likely to be a misspelling of the high-frequency token and can be replaced by the highfrequency token. The validity of this assumption is dependent upon several factors. These include, what is a high frequency, what is a low frequency, and what is very similar.
The process is controlled by four parameters:  MinFreqStdToken -The minimum frequency of a token that can be used to replace another token, i.e. can function as a "standard" token.
 MinLenStdToken -The minimum string length of a standard token.
 MaxFreqErrToken -The maximum frequency of a token that can be replaced by a standard token, i.e. can be treated as an "error" token.
 MaxStringDist -The maximum string (character) distance between a standard token and an error token before the error token can be replaced (usually 1 as measured by Levenshtein edit distance).
The replacement table has a one-to-many relationship between standard tokens and error tokens. One standard token could replace many different error tokens, but an error token www.ijacsa.thesai.org can only be replaced by one standard token. Some actual examples of rows from the Replacement Table for Sample S8  are shown in Table I Table I shows some examples of token replacements generated in Phase II. It is important to note the token changes made in Phase II are not permanent changes to the source data. The token changes in Phase II are intended to improve the cluster (ER) results in Phase III. Research is continuing on the development of Phase IV to make more accurate token corrections (standardization) at the cluster level. If it can be demonstrated the clusters produced by Phase III are reasonably accurate, then the criteria for identifying misspellings described for Phase II can be more aggressive when applied at the cluster level than at the file level. For example, while the replacements shown in Table I at the file level risks overwriting valid tokens, the same replacement is more likely to be valid within a cluster of references believed to be for the same person. Changes at the cluster level could also be applied to numeric tokens. For example, if five out of six references in a cluster have the token "413", and the sixth reference has "431", and all six instances are preceded and followed by the same token, then it is not unreasonable to assume "431" is a mistyped version of "413".

C. Phase III -Clustering (ER)
The purpose of Phase III is to cluster records for the same entity in support of data deduplication and data integration. This phase is more complex than Phases II and III and involves iterating over the tokenized source records coming out of Phase II. The clustering phase is a series of 13 processes labeled P1 through P13.

1) Process P1: Tokenize and compute token frequencies:
Because the references have already been tokenized in Phase I, the re-tokenization here is simply a matter of separating the reference identifier and splitting the remaining substring by blank (white) space. While computing token frequencies is redundant with the same process in Phase II, for experimental purposes this was done to make Phase II an optional process allowing the evaluation of data integration results with and without token replacement.
2) Process P2: Tokenizing references and appending blocking tokens: Process P2 is the start of an iterative process on the "reprocess file". Initially, the reprocess file is a copy of the original input file from which the frequency dictionary was created in Process P1. However, as the POC progresses, the reprocess file becomes a smaller and smaller subset of the original input source until there are there no more references to the process ending the iterations.
Process P2 repeats the tokenization process described in Process P1 in which each reference is split into a list of tokens. However, Process P2 has access to the token frequency dictionary previously build in P1. Process P2 has two primary functions:  To rebuild each input reference as a string of blankseparated tokens, omitting all tokens found to have a frequency above the stop word frequency threshold (σ) creating "skinny references."  To output a copy of the skinny reference for each blocking token found in the reference.
Again, a blocking token is simply any token with a frequency below the blocking frequency threshold . This means the output from P2 will have more records than the input assuming almost all references have at least one blocking token, and many have more than one.
Example: Suppose an input reference has the form R13, John Doe, Oak St, Anyville AL, 793-1234 The tokenization of this reference would produce 9 tokens "R13", "JOHN", "DOE", "OAK", "ST", "ANYVILLE", "AL", "793", and "1234" (using Splitter tokenization). Also, suppose the tokens "JOHN". "DOE", and "OAK" have a frequency below , and the tokens "AL", "ST", and "793" have a frequency above σ. Then P2 will generate three outputs. Because the input reference R13 contains three blocking tokens, P2 will output three skinny references, one for each blocking token. To simplify parsing, the output reference is divided into three segments using the colon (:) character. The first segment is the reference identifier, the second the blocking token, and the third the body of the reference.

3) Process P3: Sorting by blocking tokens to create blocks:
The purpose of Process P3 is to sort the output of the reference from process P2 into ascending order by blocking token (Segment 2 of the rebuilt references). Each sequence of consecutive references with the same blocking token will form a block for input to the ER process for record linking. www.ijacsa.thesai.org 4) Process P4: Iterate blocks: Process P4 is the start of an iterative process (P5) to be performed on each block. P4's primary function is to detect the sequences of consecutive records having the same blocking token, then pass this block of references on to P5.

5) Process P5: Link reference pairs in blocks:
In Process P5, each block undergoes a process to generate pairs of linked references. The technique implemented in the POC is to use a multi-token comparator. Every pair of references in the block is compared. For a block of N references, there will Nx(N-1)/2 pairs. Any pairwise matching process can be inserted at this point including machine learning (ML) algorithms for linking. Because the entity references are text, this approach usually requires an additional process to convert references from the text to numeric vectors, a process called text embedding. Some results from using the DBScan clustering algorithm with doc2vec text embedding are shown in this paper (Table III).
Most of the work described here used the scoring matrix. In this case, a variation of the Monge-Elkan method [21] for comparing multi-token values, but with the removal of stop words. When the scoring matrix processes a pair of references, each reference is first transformed into a list of tokens (words), then the stop word tokens are removed from the list. The remaining tokens from the first reference are used to label the rows of the matrix, and the remaining tokens from the second string label the columns of the matrix. The cell value of the matrix is a normalized similarity measure, i.e., a value in the interval [0,1], between the two tokens. In the POC, the normalized Damerau-Levenshtein Edit Distance (nLED) function was used.
To illustrate the operation of the scoring matrix, consider the following two references: A045, Smith, John, Apt 21, 345 Oak St, Anytown, NY B167, Jon Smith, 345 Oak Street #21, Anytown, NY Furthermore, suppose the threshold for the comparator has been set to 0.80, and the list of stop words contains the token "NY." The resulting token matrix would then appear. The process begins by finding the largest similarity value in the matrix. This value is the initial value of a total running value. After the largest similarity value is used to initialize the total value, all of the values in the same row and column are removed (set to zero). In the next iteration, the largest similarity value from the remaining values in the matrix is identified and added to the overall total.
Again, all of the nLED values in the same row and column as the largest value are removed. The process continues in subsequent iterations until all of the similarity values have been removed from the matrix. In Fig. 1, the cells with underlined and bold font are the surviving similarity scores from this process. After the last iteration, the running total is divided by the number of iterations. If the calculated average value is greater than or equal to a threshold value provided by the user, then references are linked. At the end of the algorithm, the final matrix score for a pair of references in Fig. 1  6) Process P6: Linked pair generation: The purpose of Process P6 is to form the graph edges between pairs of references in the same cluster. Because the clusters are all formed from references in the single token block, they only represent the connections found between references sharing the token forming the block. 7) Process P7: Post-resolution transitive closure: Unlike traditional match key blocking, frequency-based blocking does not produce a true partition of the input references where each input reference is in one, and only one, block. In frequency-based blocking, each reference is replicated by the number of blocking tokens it contains as in the example for Process P3. To create the final set of clusters, in which each reference occurs in one, and only one, cluster, the clusters created from the blocks must be merged and undergo a transitive closure process.
The POC implements a very efficient sorting closure process described by Kolb et al [22]. While the sorting transitive closure is implemented in the POC as an in-memory, Java application, the algorithm is a highly-scalable, map/reduce process for execution in the Hadoop Distributed File System (HDFS) environment.

8) Process P8: Iterate clusters:
Process P8 transforms the transitive closure output into clusters of linked references. Because the output of the sorting closure process is already in sorted order by cluster identifiers, the clusters are simply groups of consecutive references with the same cluster identifier.
9) Process P9: Entropy Calculation: Process P9 uses a variation of the Shannon entropy calculation [23] to assess the level of organization in each cluster of two or more references. The formula for the calculation of the entropy of a cluster is Where t j is the j-th vertical token group in the cluster, and p(t j ) is the probability of t j .
For the POC, a vertical token group is defined to be the same token counted only once in each reference of a cluster. Thinking of the cluster as a matrix where the references are the rows and the columns are the tokens, then a vertical token group is a vertical grouping of the same token across different references. However, each token is counted only once in each reference. This means the maximum size of a vertical token www.ijacsa.thesai.org group is equal to the number of references in the cluster. The probability of a vertical token group is the size of the token group divided by the number of references in the cluster. For example, consider the following cluster of 3 references. The first vertical token group is for the token "JOHN" which only occurs once in R1 forming a vertical token group of size 1 with a group probability of 1/3. The second vertical token group is for "GRANT" which has 3 tokens, one token each from R1, R2, and R3 giving this group a probability of 1.0 (3/3). The second "GRANT" in R1 is not part of this token group because each token is only counted once in each reference. The token group for "123" has a probability of 1/3, the second "GRANT" group has a probability of 1/3, and the "ST" group a probability of 2/3.
After exhausting all of the tokens in R1, there are still four uncounted tokens in R2 forming the "MARY" group with probability 2/3, "21" group probability 2/3, the "OAK" group probability 2/3, and the "STREET" group probability 1/3. Finally, there are no remaining uncounted tokens in R3. In total, there are 9 vertical token groups in the example cluster. The total entropy of the cluster is calculated from Formula (2) by: Entropy is a measure of the organization of a cluster in terms of having similar tokens [24]. The entropy of a cluster decreases as references in a cluster have more and more similar tokens. By this measure, a cluster will have an entropy of 0 if, and only if, all of the references have the same tokens.

10) Process P10: Assessment of clusters based on entropy:
In Process P10, the entropy of each cluster as calculated in Process P9 is assessed against the user-defined entropy threshold . If a cluster has an entropy less than , it is judged as an acceptable cluster, and the reference identifier and cluster identifiers from the clusters are written to the Saved Clusters output file. Otherwise, the cluster identifiers are discarded, and the references are written to the Reprocess file. References written to the Reprocess file will go through the entire blocking and ER process again but at a higher match threshold. By definition, all clusters of size one (singleton clusters) have an entropy of 0 and are written directly the Saved Clusters file.
For each cycle of the POC, the size of the Saved Clusters file increases while the size of the Reprocess file decreases. The Reprocess file will eventually become empty as the match threshold  approaches 1.0. At very high match thresholds, the references in a block can only form clusters if they are highly similar and generate clusters of very-low entropy, otherwise, they break down into singleton clusters. In either case, they will eventually pass to the Saved Clusters file and the Reprocess file will be empty. An example of this process is shown in Table I. The statistics are produced as part of the statistics report when running a sample. In this case, the statistics are for Sample S4 of 1,912 references. As shown in Table IV, the parameters for this run were β=12, σ=22, =4.2, and the starting value of =0.5.
The volume of work continually decreases with each iteration. Note that some references written to the reprocess file will not be used in the next iteration. This is because, at the beginning of the next iteration, the reprocess file is re-blocked and re-clustered. During the clustering process, reference-toreference links are only produced for references linked to at least one other reference. For example, 27 references were written to the reprocess file at the end of the =0.8 iterations, but only 14 of these references survived to form 6 clusters of two or more references when the match threshold  was increased to 0.90.

11) Process P11: Reprocess decision: As described in
Process P10, at some point the Reprocess file will be empty. When this happens, the reprocessing cycle stops, and the final join (Process P13) is performed.
12) Process P12: Increasing Match Threshold: If there are references to be reprocessed, then the match threshold is increased before the reprocess is started. Increasing the match threshold will require references to be more similar before they are linked into the same cluster. In all of the results reported here, the increment value was 0.1 (10%).
13) Process P13: Final join to original source: Although no further iterations are necessary when the Reprocess file is empty, there are still two tasks to complete. The first task is to ensure every reference in the source is represented in the final set of clusters. Some references in the source may not be transferred to the Saved Clusters file. Depending upon the value of blocking frequency threshold , some references may not contain blocking tokens and are not output from Process P2.
The second task is to append the final cluster identifier to each reference in the source. The goal is to create a Final output comprising every reference in the source along with its proper cluster identifier. Both of these tasks can be completed by performing an outer join by reference identifiers between the original Reference Source file and the Saved Clusters file.

D. Cluster Cleaning
While this process has not been implemented in the POC, work is currently underway to develop unsupervised techniques for cleaning and standardizing tokens within the same cluster. The current approach is very much the same as the Global Token Replacement described in the Second Phase. However, replacements can be more aggressive at the clusterlevel versus the file level. Across an entire reference file, there could easily be an entire sequence of house numbers, such as 123, 124, 125, and so on. For this reason, numeric tokens are specifically excluded from replacement globally. However, at the cluster level, it is much more probable that if 5 of 6 www.ijacsa.thesai.org references have the token 123 and the sixth reference has 124, the replacement of 124 by 123 would be a correction.

III. POC TEST SAMPLES AND RESULTS
To test the POC, 18 samples were taken from four fully annotated reference sources. Aside from having equivalent references to match, the samples also exhibited combinations of two other characteristicshigh data quality (DQ=Good) versus low data quality (DQ=Poor) and uniform record layout (Mixed=No) versus mixed (Mixed=Yes) record layout. In all cases, the Splitter Tokenization Method with token deduplication was used in Phase I. To gauge the effect of Phase II (Global Token Replacement) each sample was run with, and without, the global token replacement. In the cases where token replacement was run, the settings described in Section on the Second Phase the parameters were fixed at To establish a baseline, all samples were run with 0.50 as the initial value of  and 0.10 as the increment value for. The initial values for β and σ were set using the linear regression prediction formulas (3) and (4) for the non-iterative model [25]. However, the actual values for β, σ, and  were set manually by observing the correlation between the F-measure of each cluster and the computed entropy as logged by the system (Table IV). Then exploring a range of values around these estimates using a grid search automated with a robotic Python process to run each range of settings and collect the precision, recall, and F-measure results. The results for all 18 samples using the best parameter settings are given in Table IV.

A. Samples with Good Data Quality
Stratified samples S1, S2, S4, S5, S7, S13, S14, and S15 were drawn from a corpus of approximately 800K references created using the R-package "generator", and degraded with data quality errors using the R-package "relErrorGeneratoR" from GitHub.com. While some reference-level errors such as misspelling, truncation, mixed formatting, and missing values were injected into the data during generation, the individual references in the 800K corpus are of relatively high quality.
The majority of the data quality errors introduced into the 800K corpus were data redundancy (duplicate record) errors to make the corpus more useful for entity resolution research. Shown here are two references from Sample S4 with Record Layout A. The only variations between the two references are the name truncation (initial) and different formats for telephone numbers and identification numbers. Sample S6 was produced by the GeCo synthetic data generator [26], and Sample S3 is a file of 866 references to restaurants (businesses) from two public sources, Zagat's and Fodor's restaurant guides. The references contain restaurant names, addresses, city, phone, and type of cuisine. The file has been manually annotated and is known to have 112 pairs of equivalent references [27]. Examples of references from S3 are shown here.

B. Low Data Quality Samples
Samples S9, S10, S11, S12, S16, S17, and S18 were taken from the SOG (Synthetic Occupancy Generator) project [28]. The SOG corpus has approximately 270K references with three different record layouts A, B, and C. The SOG corpus has a much higher level of data quality errors than the 800K corpus. Most records exhibit at least one error such as missing value, misspelling, truncation, inconsistent formatting, nicknames, and name and address changes. Shown here are three equivalent references from Sample S8 exhibiting a number of these data quality issues.

C. Mixed Layout Samples
In addition to variations in quality, Sample S7 and Samples S10 through S18 were selected with mixed (heterogeneous) record layouts. For example, in Sample S7 about half of the references were in Record Layout A and the other half in Record Layout B. The two layouts used a different order for names and have different identity attributes, e.g. social security number in Layout A and date-of-birth in Layout B. An example of a pair of references from S7 is shown here. As noted, the values for β, σ, , and the starting value of  were found by a grid search. Prior research in using the scoring matrix for ER on samples from these same corpora [25] [29], [30], [31] provided some guidance of the best values for the blocking threshold frequency threshold β and the stop word frequency threshold σ based on the size of the sample and the standard deviation of its token frequency distribution. These However, the previous research did not involve the entropy-based, self-evaluation, or iteration with incrementally increasing match thresholds use in this research, thus it did not provide any guidance about the best setting for the entropy threshold . Instead, the estimated value for  was found by observing the entropy measure of each cluster and comparing it to the actual F-measure of the cluster. This was possible because all of the test samples were fully annotated. The Fmeasure assessment of each cluster was an augmentation to Processes P5. As the entropy of each cluster is calculated, the cluster was also sent to an ER metrics program to determine the actual F-measure of the cluster as compared to the annotated truth set.
The entropy and the actual F-measure of each cluster were captured in a Cluster Analysis text file. Table II shows a segment of the report produced when running Sample S2. The table shows the results of three iterations. Row 1 of Table II shows the last cluster produced by the initial value  at 0.5, Rows 2 through 6 show the entire second reprocess iteration of four clusters where the value  was 0.6. Row 7 is the first cluster of the last iteration where  was 0.7. As each cluster is formed, its entropy is calculated as shown in the column labeled "Entropy." If the cluster's entropy is above the value of  (set at 4.3 for this run) the cluster is judged to be "bad" and is written to the reprocess file for re-linking at the next higher value of the match threshold .
On the other hand, if the entropy is less than or equal to , it is written to the "good" file as a final cluster. Table II shows for Rows 4 and 5 these were correct decisions. In both cases, the F-Measure was less than 1.0 when the entropy was above 4.3. However, Row 2 is an exception. Even though the entropy of 9.0446 is above 4.3, the cluster had an F-Measure of 1.0 and was correctly linked. However, because the entropy was above the threshold, the references were put back for reprocessing in the third iteration. In the end, the F-Measure for S2 at the end of the process was 0.8842 as shown in Table II.   TABLE II. SEGMENT

D. Example Results using Machine Learning for P5
In this example, Sample S4 was processed using DBScan (Density-Based Spatial Clustering of Applications with Noise) [32] as the ML clustering algorithm and using the doc2vec [33] word embedding algorithm to create the numeric vectors as input for DBScan. As implied by its name, the doc2vec algorithm converts an entire document into a vector. For the POC, each reference was considered a document so there is a one-to-one correspondence between each input reference and each vector clustered by DBScan.
The doc2vec algorithm was applied to each block using the following parameters. DBScan algorithm was imported from the Python 3.7 library learn. cluster. This version has two control parameters "eps" and "min_samples". The eps parameter controls the neighborhood reach (proximity) of vectors to be in the same cluster, and the min_samples parameter defines the minimum size of "core samples", i.e. the minimum number of vectors within eps distance of each other. The results from using this configuration are shown in Table III.

IV. CONCLUSION AND FUTURE RESEARCH
The results are shown in Table IV suggest entropy can be an effective way to regulate an unsupervised clustering process. The POC using the scoring matrix performs extremely well when processing good quality references such as Samples S1 -S7 and S13 -S15. The average F-Measure for these samples was 0.9124 with an average precision of 0.9609.
The average F-Measure for the poor quality samples S8 -S12 and S16 -S18 was somewhat lower at with an average F-Measure of 0.7772 and precision of 0.7351. The results also indicate the POC is more sensitive to data quality issues than to mixed record formats. The good-quality, mixed-format Samples S7, S13, S14, and S15 had an average F-Measure of 0.8866 compared to an average F-Measure of 0.9426 for goodquality, single format samples.
For the good quality samples where the clustering precision was 96%, the hope is that applying a more comprehensive cleaning and standardization at the cluster level will be able to provide much better results. The goal for future research is, just as linking results can be continually improved through iterative reprocessing, the same reprocessing loop will also incorporate processes to continually improve the quality of the references, which in turn, would further improve the linking results. The POC described in this paper shows the unsupervised ER improvement part of this positive feedback loop is feasible. The next step will be to integrate additional unsupervised data quality improvement processes.

A. Industry Testing
As an experiment, a commercial company tested the POC (data washing machine) approach using a real-world dataset of 70,500 business names and address references with mixed record layouts. Because the dataset was not annotated, it was www.ijacsa.thesai.org not possible to calculate the exact F-measure of the overall clustering results. However, the company did undertake an extensive manual review of the POC results in comparison to results from their standard process. The company determined the POC results to be as good as, and in many cases better than, the results from their standard process, but with the added advantage of avoiding the time and effort to analyze and prepare the data required by their standard process.
Besides, the company is experimenting with some variations of the original POC design described in this paper. In particular, they have been able to improve the clustering accuracy for their datasets by using a computed value for , the entropy threshold. In their approach, they consider five factors when assessing the entropy of each cluster. These are  The match threshold  used to form the cluster.
 The numbers of references in the cluster (size).
 The maximum number of tokens in any one reference in the cluster (maxT).
 The minimum number of tokens in any one reference in the cluster (minT).
 The average number of tokens for all references in the cluster (avgT).
In Process P10, instead of comparing the entropy of the cluster to a static value of  as in the original POC, they compute a dynamic threshold based on the factors listed above. In particular,  is computed as In another change, they were able to improve the precision of the clustering by modifying the scoring matrix used to link references in Process P5. The comparator was modified to use a Boolean similarity of the match (1.0) and no-match (0.0) when comparing numeric tokens while still using the normalized Damerau-Levenshtein edit distance when comparing non-numeric tokens.

B. Predicting Parameters and Scalability
However, there are still two gaps that must be bridged to make the POC a practical solution for most real-world use cases. The first is a reliable method for setting the optimal values of the key parameters β, σ, and . In a research environment using fully annotated references, these values can be found by simply observing where the best results were obtained based on comparisons with the correct linking. When working with real data, this is not generally possible. A practical unsupervised ER system needs a way to predict these parameters for a given set of input references. Creating such predictive models is still research in progress.
The second consideration is scalability. The current POC is implemented in a combination of Python and Java, and as written, it is not very scalable. The blocking and the stop word removal process can be combined with the token counting process to avoid the need for storing an in-memory token frequency table.
The POC can be converted to an HDFS Map/Reduce process. The references can easily be tokenized in the mapping process which then reduces on the token. The reducer can then emit two kinds of key-value pairs for each token group. The first is (RefID, Token) where the token has a frequency below σ (not a stop word). The second is (Token, RefID) where the token has a frequency below β (a blocking token). Sorting and reducing the first pairs on the RefID as the key will create the skinny references of Process P2 while sorting and reducing the second pairs on Token as the key will create the blocks. The join of these two outputs on RefID will be the equivalent of creating and sorting the blocked file in Process P3. Next, the blocks can be mapped to distributed nodes for pairwise linking in parallel with the assurance no block will be larger than . The outputs are the Process P6 Linked Pairs. The transitive closure of the pairs in Process P7 using the algorithm of Kolb et al [22] is already an efficient map/reduce process. Process P8 then becomes a map of the clusters to parallel processing work nodes performing the entropy calculation (Process P9) and triage of clusters (Process P10) into "good" and "bad" cluster outputs.
The POC described here was built uses the simplest of approaches which could no doubt be dramatically improved through additional research and experimentation including investigating different starting values for, and exploring its sensitivity to the increment value currently fixed at 0.1. Also, building prediction models for these parameters. Another is investigating whether the results are improved by modifying the values of β, σ, or  for each reprocesses iteration, and if it does, how should they be modified to produce the best linking results.