Consolidated Definition of Digital Transformation by using Text Mining

—Digital transformation has become essential for the majority of organizations, in both public and private sectors. The term "digital transformation" has been used (and misused), so frequently that it is now somewhat ambiguous. It has become imperative to give it some conceptual rigor. The objective of this study is to identify the major elements of digital transformation as well as develop a proper definition for DT in the public and private sectors. For this purpose, 56 different definitions of DT collected from the available literature were analyzed, and we found that they extracted elements from definition of DT manually. So, text mining (TF-IDF and Fp-tree) techniques are used to identify the major constituents and finally consolidate in generic DT definitions. The approach consists of five phases: 1) collecting and classifying DT definitions; 2) detecting synonyms; 3) extracting major elements (terms); 4) discussing and comparing DT elements; 5) formulating DT definitions for different business categories. An evaluation tool was also developed to assess the level of DT elements coverage in various definitions found in the literature, and, as a validation, it was applied to the formulated definitions.


INTRODUCTION
In a world of emerging and continuous change, digital transformation (DT) has become a necessity for most organizations, both in the private and public sectors. The word "digital transformation" has been used in a broad sense to include many ideas that lead to widely divergent viewpoints. Few attempts have been made to define DT. Based on a review of 56 definitions, we could identify two fundamental approaches to defining DT: One is based on the scope of the study , [39][40][41][42] and the other is based on the perspective of expert(s) interview as [38] in the private sector or in [43] for the public sector. According to [1], the phrase "digital transformation" does not have a generally accepted definition. Without properly defining the DT, proper assessment and proposition of DT solutions (Framework, Model, or Architecture) are not possible. In a recent study [1], an effort was made to define digital transformation, but, this study had two limitations: a) it did not classify the prior definitions and b) it extracted the manually DT elements (based on their frequency). To the best of our knowledge, no study has so far defined DT elements using text mining techniques. To go beyond these limitations, we propose a comprehensive approach, using text mining algorithms to objectively extract the DT elements. We categorize the prior DT definitions into two groups: in the public sector and in the private sector. In this study, text mining is used to answer the research question: What are the key elements of the DT definitions in the public and private sectors as well as in general (all definitions)? The rest of this paper is organized as follows: In Section II, the proposed an approach is described. In Section III, results of text mining techniques are presented. In Section IV; results of digital transformation elements are discussed. In Section V, definitions of DT are proposed. In Section VI, we present a tool to asses various DT definitions. Finally, the conclusion, limitations and future work are presented.

II. PROPOSED APPROACH FOR DEFINING DT ELEMENTS
Our general approach for defining major elements in digital transformation definitions is outlined in Fig. 1. We will give a brief description of each phase as follows:

A. Phase One: Collecting and Classifying DT Definitions
The first phase is responsible for gathering existing definitions from recent literature specialized academic literature as well as from the websites of specialized private companies such as IBM, Google, and Oracle... (Our data set included 56 definitions).

B. Phase Two: Replacing Synonyms
Analysis of the acquired dataset revealed the existence of specific bigrams and n-grams (e.g. big data, business process, business model, etc) that must appear as block. Thus, the synonym identification phase was proposed, where these ngrams are ligated and replaced in the dataset, for example big data replace with (Bigdata). We also replace some words like "artificial intelligence" and "internet of things" with their shorthand (AI, IOT).

C. Phase Three: Extracting Major DT Elements (Using Text
Mining) The main elements (terms) of the definitions of digital transformation are extracted from the 56 collected definitions using traditional text mining techniques. This requires proper preprocessing of the acquired text (tokenizing, removal of stop words, stemming, and case transformation). The TF-IDF method, being the most widely used method [60] in the literature, was used to identify the most frequently used terms and according to [61], Fp-tree algorithm offered good results for extracting association rules from text. Therefore, in this work we used the TF-IDF method (one gram) to extract frequently occurring terms from DT definitions and used Fp trees to extract association DT elements.

III. RESULTS OF TEXT MINING TECHNIQUES (RESULTS OF PHASE THREE)
Following results are obtained on laptop running Dell-core i7, Windows 10. The approach was implemented using Python 3.7.4 and RapidMiner Studio-9.10.1. Table I, shows the experimental parameters for text mining algorithms. The confidence in the Fp-tree in all DT categories is 1.
We will discuss the results of applying the TF-IDF and FP tree algorithms to DT definitions elements as follows:

B. Method 2: Applying Association Rules
In this sub-section, we will apply association rules (Fp-tree) to each category of DT definitions as follows:  The main elements in the public definitions.
The results of running the Fp-growth algorithm are shown in Table V. It can be seen that the final set contains three words that appear to be associated with one another: businessmodel, businessprocess and digitaltechnology. The results of running the Fp-growth algorithm are shown in Table VI. We would be able to see that the final set contains three words that appear to be associated with one another: customerexperience, businessmodel and businessprocesses.
 The main elements in general DT definitions (all definitions).
The results of running the Fp-growth algorithm are shown in Table VII. We would be able to see that the final set contains four words that appear to be associated with one another: business-model, businessprocess, DigitalTechnology, and digital. Based on the results of applying the TF-IDF and Fp-tree to DT definitions, we can note that in the private definitions, the most important technologies are IoT and cloud computing, compared to data mining and big data in the public definitions. We also found that definitions in the private sector focus on business, the customer, and innovation, while in the public sector they focus on services, government and citizens. We can www.ijacsa.thesai.org define intersecting elements between DT definition categories as shown in Fig. 2. In general digitaltechnology, digital, businessmodel, and businessprocess are intersecting elements across all DT definition categories. This suggests these are minimum elements to define DT. It can be noted that all intersection elements originated from method one (TF-IDF), except (BusinessProcess->BusinessModel) l which identified from method two (Fp-tree) in the intersection between public and general definitions. So we didn't draw a Venn diagram in FP-tree.

V. PHASE FIVE: PROPOSING DT DEFINITIONS
Based on the DT elements that have been defined before, we can define DT in three categories as follows: DT is a process that leverages digital technologies to change an organization of government or business, business model and business processes, to create value for consumer (customers or citizens).

 DT Definition in Public.
DT is a process that leverages digital technologies (bigdata, data mining), to change government, business process, business model, services and citizens.

 DT Definition in Private.
DT is a process that leverages digital technologies (big data, cloud computing and IOT) to change an organization, business model, business processes, to create value for customer.

VI. PROPOSING A TOOL TO ASSES VARIOUS DT DEFINITIONS
As mentioned before, we have two main categories of DT definitions (public, private) and combined between them to create a new category called "general". Each category contains a set of elements, as discussed above. In the following, we will try to find the percentage of covered elements by each definition in each category, as well as identify the percentage of missing elements for each definition. For this purpose, we developed algorithm 1. There are two inputs to this algorithm. The first input contains dataset that includes reference numbers and definitions (text). The second input contains a dictionary (data) where key is category: (Public, Private, and General) and value is DT elements for each category that contains two lists: List 0 contains words that appear in definitions in sequence as a block (come from the TF-IDF); whereas List 1 contains words that appear in definitions in sequence but not as a block (come from the Fp-tree algorithm).We follow several steps to calculate the percentage of covering and missing DT elements, which are:

A. For Each Algorithm, the Percentage of Words Covered is Calculated in Each Definition as Follows
Where CTF is the percentage of words covered in definitions I (1 ...59) in term frequency, c is category (c set of index: private, public, general), NW is Number of words covered by each definition I in each category c, and TW is total number of words in each category according to TF-IDF. www.ijacsa.thesai.org  FP-tree (2) Where CFP is the percentage of words covered in definition I in Fp-tree, c is category (c set of index: private, public, general), NW is Number of words covered by each definition I in each category c and TW is total number of words in each category according to FP algorithm.

B. The Total Percentage of Words Covered is Calculated in
Each Definition as Follows Where TC is total covered words in each definition I, in-TF and Fp-tree.

C. The Percentage of Words Missing in Each Definition is Calculated as Follows
Where MP is missing percentage in each defilation I in each category c.  It has been found that 15% of definitions are classified as public definitions.

VII. DISCUSSION OF RESULTS (APPLYING OUR TOOL)
 It has been found that 5% of definitions are classified as general definitions.
 The definition that covered the most elements is [8] in general with (35.1%), in public with (23.85%) and in private with (29.65%).
 The definition that covered the lowest elements in public is [15] with (3.35%) and [4] in private with (7.9%) and in general with (6.5%).
B. When Applying our Algorithm to the Private Category (21 Definitions), it has been found that  Proposed algorithms agree with (66.66%) in the classification of private sector definitions and differ (28.95%) as they were classified as definitions in the public sector and (4.76%) as a general category.
 The definition that covered the most elements is [34] with (28.95%).
 It has been found only three definitions covering elements in Fp-tree which are [23], [26], and [36] as shown in Table VIII.
C. When Applying our Algorithm to the Public Category (15 Definations), it has been found that  Proposed algorithm agrees with (86.66%) in the classification of the public definitions and disagree (13.33%) as they were classified as private definitions.
 The definition that covered the lowest elements is [42] with (6.5%).
 The definition that covered the most elements is [47] with (26.5%).
 The definitions that covered most elements in TF are [47] with (53.3%) compared to [53] in Fp-tree with (7.7).
D. When Applying our Algorithm to All Definitions it has been found that  Proposed algorithm agreed with 75% of the previous studies' classification of DT definitions private (21) and public (15) while disagreeing with 25%.

E. When Applying our Algorithm to our Definitions it has been found that
It can be seen that our definitions cover the largest percentage of the DT elements in general (all definitions) [57] with (42.5%), in the public definitions [58] with (38.5%), and in the private definitions [59] with (38.7%). Overall, our definitions have achieved the highest percentages in (TF-IDF, Fp), which gives us an indication that our definitions are more comprehensive.

VIII. CONCLUSION
Although digital transformation is a hot topic right now, there is no generally accepted definition, which has implications for both researchers and practitioners. Consequently, the goal of this study was to learn more about the concept of digital transformation. According to the analysis of previous definitions of digital transformation, we can divide them into two groups: in the private sector, in the public sector and create a new group called in general. We propose a comprehensive approach to defining major elements in DT definitions in each category as well as in general (all definitions). This approach consists of five phases. The first phase is used for collecting and classifying DT definitions. The second phase is responsible for synonyms and defining the words that must appear together. The third phase is responsible for extracting major DT elements in each category using text mining methods (Fp trees, TF-IDF). The fourth phase is used to discuss and compare DT elements. The fifth phase is used to propose new definitions of DT in the private, public, and general. In the end, we propose an assessment tool (algorithm) to identify the percentage of covered elements for each definition in each category and define the percentage of missing. The results of applying TF-IDF in general showed that: digitaltechnology, digital, businessmodel, businessprocess and change are common elements across all DT definition categories. This suggests these minimum elements to define either in private or in public. In the private category, our algorithm classified 66.66% of them as private, compared to 28.95% classified as public and 4.76% classified as general. While there are 86.66% of people who classify DT definitions in the public domain, and our algorithm puts them in that category compared to 13.13% in the private domain. The assessment tool agreed 75% with the previous classification of definitions and did not agree with 25% of them.
We also use the assessment tool to identify categories of definitions [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20] that were not previously classified. The assessment tool classified 80% of them as private definitions, while classifying 15% as public definitions and 5% as general. Overall, when using our assessment tool to define category to all defilations (56), it has been found that the most definitions classified as private with 57.14% followed by public category with 39.28% and general 3.57%. This indicates that most definitions of digital transformation focus more on the private sector than others. It can also be noted that our proposed DT definitions covered the largest percentage of the DT elements in general (all definitions) with 42.15%, in private with 38. 7%, and in public with 38.35%. This shows that our suggested definitions are more thorough. This study was limited by the small number of definitions that were examined (56), and this shortcoming will be overcome in future study. We are looking forward to doing a lot of experiments using other text mining algorithms as well as trying to apply our approach to other domains.