A Framework for Weak Signal Detection in Competitive Intelligence using Semantic Clustering Algorithms

—Companies nowadays are sharing a lot of data on the web in structured and unstructured format, the data holds many signals from which we can analyze and detect innovation using weak signal detection approaches. To gain a competitive advantage over competitors, the velocity and volume of data available online must be exploited and processed to extract and monitor any type of strategic challenge or surprise whether it is in form of opportunities or threats. To capture early signs of a change in the environment in a big data context where data is voluminous and unstructured, we present in this paper a framework for weak signal detection relying on the crawling of a variety of web sources and big data based implementation of text mining techniques for the automatic detection and monitoring of weak signals using an aggregation approach of semantic clustering algorithms. The novelty of this paper resides in the capability of the framework to extend to an unlimited amount of unstructured data, that needs novel approaches to analyze, and the aggregation of semantic clustering algorithms for better automation and higher accuracy of weak signal detection. A corpus of scientific articles and patents is collected in order to validate the framework and provide a use case for future interested researchers in identifying weak signals in a corpus of data of a specific technological domain.


I. INTRODUCTION
In the era of big data, information flows from different sources and in huge volumes. Companies and organizations are under many threats coming from different opponents and competitors. Strategic decisions must be made in order to survive the market changes and cultural, technological, or political shifts that may occur in their environment. Economists rely on the most popular models for strategies to conduct a thorough competitive intelligence activity [1] [2] for example : SWOT analysis's main purpose is to analyze threats and opportunities and develop plans to react strategically to those events, this model can be supported by using weak signal detection and early warning signs techniques [3].While PETS model analyzes the data concerning the environment of the company by monitoring political, economic, technological and social factors in order to prepare strategic responses to any change so it can maintain a dominant position in the market. Many organizations invest heavily in developing systems to automate the process of competitive intelligence [4] [5] and implement their adopted strategies. One of the main goals and features of those systems is the detection of weak signals in the environment surrounding an organization. Environmental scanning is gaining the attention of many stakeholders due to the benefits and advantages [6] it brings to the well-being and the contribution to the sustainability of their companies. The aim of most environmental analysts is to detect pieces of valuable information that will give them the strategic advantage of anticipation and early response planning, this can be done through weak signal detection. A weak signal is defined as a temporal change that occurs in a domain or a topic or in the environment in general [7], and it may have an impact on the future and become a trend. Therefore, the early detection and identification of this strategic information is crucial to the evolution of an organization. Many definitions are given to this concept, and different techniques and approaches are applied to detect this kind of information automatically, which is the subject of the next sections.
Companies must be able to understand and explore their environment to extract implicit knowledge that cannot be identified by experts. But it should also be able to predict the future evolution of a specific domain. The emergence of web data and the availability of information online pushes the companies nowadays to exploit these data to extract meaningful strategic information that allows them to make optimal and strategic decisions based on a scientific accurate analysis of the data, and an intelligent approach of web mining [8] to extract high-quality data.
Competitive intelligence systems are software that groups together a set of tools and technologies that companies have to implement in order to keep track of their evolving environment [9]. Many of these solutions neglect the anticipative information model that helps predict and monitor trends that unfold threats and opportunities that must be harnessed and used to gain a competitive advantage.
Weak signals are pieces of information that will help companies to identify threats and opportunities in their environment, which in turn will allow the implementation of an anticipative strategy rather than a reactive one that responds to the events as they happen. www.ijacsa.thesai.org companies must use the latest big data technologies and advanced algorithms in order to process and analyze this data efficiently to identify weak signals [10]. In this paper we propose a framework for weak signal detection in collected data from the web, using big data technologies and aggregation of semantic clustering algorithms based on Apache Spark to detect weak signals and emerging trends and monitor opportunities and threats. This paper is structured as follows: in section 1, we present the definition of the main concepts of this work: competitive intelligence, SWOT analysis strategy, weak signals detection, competitive intelligence systems, big data analytics, semantic clustering algorithms. Section 2 will present some of the related works that try to handle and propose novel tools and methods of weak signal detection and we will highlight some of their limitations. Section 3 presents the proposed framework and our approach to detect weak signals. Section 4 presents the results of a case study in collected articles about -big data‖, and we show the results of our approach, then we finish by a discussion and conclusion.

II. PROBLEMATIC
In order to monitor competitors and identify early warning signs that help decision makers identify companies' key intelligence needs [11], we need a framework for weak signal detection that will allow us to listen to and anticipate the changes in the market [12] by providing an unsupervised manner of analyzing data and capturing potential weak signals that evolve through time.
We define the problem and the importance of our contribution as follows: The main problem is how to process and analyze large amount of unstructured big data automatically from various sources to detect weak signals and unveil some of the strategic information hidden in a large corpus of textual documents.
We use semantic clustering algorithms with an aggregation approach to automate the detection of weak signals that share some characteristics that we defined earlier in the framework and we propose them to the final user domain expert who will then judge their usefulness in a strategic decision or action.
Most solutions do not process a variety of sources and big data, so we will try to propose a framework that is capable of analyzing data coming from multiple sources, and architecture to support the evolution of volume and velocity of data while relying on Apache Spark capabilities and semantic clustering algorithms like LDA (Latent Dirichlet Allocation), LSA (Latent Semantic Analysis) and K-Means [13] to give accurate results and high semantically related clusters of terms that may represent a weak signal.

A. Competitive Intelligence
Competitive intelligence is defined as a process, activity, service [14] that starts from the definition of a strategic need problem, passing by the collection of multiple data from different sources, and through the analysis of this data, analysts process the data using their set of tools and techniques to extract strategic information from the data and interpret it to transform it into a usable and a valuable knowledge to be disseminated to the stakeholders, every organization has a different model and strategy to conduct competitive intelligence, which varies depending on the size, the environment or the need of an organization, in order to enhance the decision making process.
The goal of conducting competitive intelligence is to define the position of an organization in the market and to help it be aware of the changes and competitive forces around its environment [15], by providing an organizational tool capable of generating valuable knowledge from raw data to guarantee better business performance by taking strategic actions at the right time [16].

B. SWOT Analyisis and PEST Model
Many models exist to implement competitive intelligence monitoring strategies. Economists proposed models to establish an environment scanning tools to prepare for any change in the market and give an objective perspective of the position of an organization. SWOT analysis [17] focuses on analyzing the strengths and weaknesses of a company through processing internal data, and opportunities and threats coming from external data. When talking about weak signals, we are more interested in analyzing the opportunities and threats coming from the market. The PEST model [18] stands for political, economic, social, and technological factors of an environment that could influence the existence of an organization and its evolution in the market. That external information can be collected and analyzed easily from external web data and exploited in technological intelligence to be able to detect innovation [19], which is present in both structured and unstructured form. The aim of this paper is to analyze big unstructured data using big data analytics technologies and efficient algorithms while respecting and following the main ideas and concepts of those two models, as in Fig. 1.

C. Weak Signals
According to Igor Ansoff [20], weak signals are defined as small changes and imprecise early indications that occur over a period of time on a specific topic that may have an ongoing impact on the future. Weak signals are temporal changes that hold important and strategic information that companies and organizations must detect and collect to stay ahead in the market [21]. This helps them implement an anticipative approach of handling the opportunities and threats present in the market in the form of unstructured data harvested from the web.
The identification of weak signals relies on some characteristics and key points. According to Ansoff weak signals are weakly mentioned in their first appearance, they are less frequent than the main concepts in the context where they exist, but they are new and novel and hold a sign of innovation or a surprise in the market. The interpretation of weak signals requires domain experts in order to contextualize the findings and transform data into knowledge, and classify them as opportunities or threats and disseminate them to stakeholders to make an appropriate strategic decision. www.ijacsa.thesai.org

D. Apache Spark
Due to the volume of data available online, data must be collected from different sources in different formats. A homogenization step is mandatory to unify the structure of the data to be collected. Once the data is collected, we end up with huge volumes of data that cannot be processed by a normal computing approach, thus the need for big data analytics technologies that support huge volumes and fast streams of data. Few weak signal detection researchers have proposed a technological framework that addresses the issue of big data. Therefore, we propose in this paper a big data framework for weak signal detection with the implementation of semantic clustering algorithms in Apache Spark.
Apache spark [22] is one of the main big data analytics technologies, and the most well-known platforms for massive distributed computing, that are popular nowadays. This framework is gaining a lot of attention in the big data community and its use in a variety of applications [23] proved to give efficient results when dealing with large datasets. Hence we chose this framework in our attempt to develop a competitive intelligence system [24] to analyze and extract strategic information from the increasing amounts of data available in the environment of companies and organizations.
Apache spark has been used in a lot of applications [25] and it has been used to implement a variety of big data platforms and solutions. Apache Spark is a part of the Hadoop ecosystem introduced in 2009. While Hadoop processing is based on the MapReduce computing paradigm, Spark relies on the DAG (Directed Acyclic Graph) paradigm, which imposes sequential processing of RDDs, a distributed unit of data nodes in the cluster, that optimizes the consumption of resources by avoiding costly data copies used in iterative algorithms that we are going to be using in the weak signal detection framework.

IV. RELATED WORK
Many researchers tried to apply variable methods to detect efficiently weak signals in large volumes of documents [26]. Those methods vary from supervised to unsupervised machine learning methods, automatic and semi-automatic methods, or manual methods relying on experts input, quantitative and qualitative methods, and many data sources were used to prove the approaches and detect weak signals.
One of the early approaches and attempts to discover weak signals in data, Yoon [27] used a keyword-based text mining method to identify opportunities in web news data. He used a quantitative method in which he performed a time-weighted analysis by calculating the occurrence and frequency of keywords during a period of time. But this attempt was limited to only one source of data, and it lacks an automatic crawling of data from multiple sources, and fails when it comes to dealing with large datasets. The result may not be easily interpreted when visualizing a large space of keywords.
El Hadadai.Anass et al [28] proposed a sequence data mining based method for extracting emerging trends and highlighting the evolution of domains through crossing terms with dates and other fields. With the application of correspondence analysis and multiple correspondence analysis and a visualization tool, this approach was able to extract clusters of weak signals from sorting and extracting clusters from the obtained matrix. The method was evaluated using a dataset from scientific articles and patents collected from scientific databases in order to identify technological weak signals, but this method lacks the possibility to be extended to support large datasets and its need for an expert to manipulate the tool to perform the analysis.
D. Thorleuchter et al [29], proposed a methodology based on idea mining and Latent Semantic Analysis to identify weak signals, by constructing a matrix based on the vectors and patterns discovered from the idea mining approach, by applying a dimensionality reduction on the matrix using SVD decomposition, which produces a set of semantically related clusters that may be a weak signal. The method is limited, as stated by the authors. They observed that the method lacked the possibility to discover implicitly cited weak signals and proposed an enhancement using Latent Dirichlet Allocation to get more accurate results.
Antonio.L.et.al [30] proposed a method to conduct an anticipative intelligence by analyzing text and identifying weak signals, using clustering k-medoids and a Jaccard function as a similarity function between obtained clusters in order to analyze similar clusters of weak signals, the method claims to be automatic but the dataset is collected from experts at the beginning of the process.
Julien Maitre et.al [31], the work that is closely related to what we are proposing is inspired by this paper, which presents a novel approach for weak signal detection in weakly structured data or unstructured data, by combining Latent Dirichlet Allocation and Word2Vec algorithm to perform clustering on a corpus of documents collected from the web, the article proposes also a method to identify the number of clusters k to be extracted from a corpus using LDA, which in most cases is hard to define and is crucial to the quality and robustness of the obtained results especially when it comes to weak signals, where the use of a small k may eliminate the identification of important weak signals.
In our approach, we try to group the three algorithms in order to reduce the mistakes and weakness of those approaches www.ijacsa.thesai.org by using a clustering aggregation [32] approach supported by the computational power of Apache Spark and the flexible nature of RDDs and their reusability in iterative algorithms in order to perform multiple tasks, and with using the ML pipeline feature of Apache Spark to facilitate and automate the process of weak signal identification with a minimum interaction of experts.

V. PROPOSED FRAMEWORK
In light of the findings of the literature review conducted by C. Muhlroth et.al [26] and other reviewed approaches [33] [34] [35] [36], we found a need to propose a big data analytics framework for automatic weak signal detection. Thus we propose in this paper a framework that uses Apache Spark to implement the data analysis from data collection to weak signal identification using semantic clustering algorithms. The feature of Apache Spark that allows us to achieve this is the ML pipeline that aims at automating steps to be applied on a dataset, in order to extract implicit hidden information that may present key strategic indications to be processed and analyzed. It outlines the steps to be followed in the pipeline implemented using Apache Spark, starting from data collection to the identification of weak signals contained in the corpus of collected documents. In the following section, we provide a brief explanation of Apache Spark ML pipeline, and we explain the steps of the pipeline in detail.

A. Apache Spark DAG and ML Pipeline
Apache Spark provides an API to manipulate RDDs, resilient distributed datasets, which is a good structure for dealing with big unstructured data. The power of this data structure remains in the possibility to expand to huge volumes of data, thus the adoption of this technique in our work. RDDs will hold the corpus data to perform analysis using ML pipeline API that represents a set of processes to perform on a dataset to get the desired results. This makes it easier to aggregate multiple algorithms into a single pipeline. We will be using this technique in our work to implement an efficient big data analytics framework for weak signal detection, by combining the semantic clustering algorithms presented in Fig.  2.
The technology that allows Apache Spark to execute such processing is DAGs which is a new enhanced strategy to perform map-reduce tasks, as shown in Fig. 3, by organizing the planning of execution in stages and steps that form a directed acyclic graph of transformations to apply on the dataset.
All the algorithms used in this framework will be implemented using the Apache Spark MLlib library that contains a variety of tools and machine learning algorithms and clustering to be applied on the data. We combine LDA and implement LSI and K-means by using ML Pipeline to perform semantic clustering on the corpus of collected data. At the end, we communicate the findings to the stakeholders and experts to identify the clusters that hold potential weak signals.

B. Data Collection
The framework starts with data collection. We collect data from multiple scientific articles databases and patents and store them in the Hadoop file system. When we start the execution of the ML pipeline we load the data from Hadoop onto the Apache spark cluster in order to execute the outlined pipeline process depicted in Fig. 2. Scientific databases from IEEExplorer, ACM Digital, and patents from USPTO, contains many articles and documents having a lot of fields like text, date, abstract, publication date, etc. We are interested in the text and publication date of the document in order to conduct a temporal analysis of weak signals and the evolution of the topic in time to perform technological surveillance on a specific field of interest.
Many scrappers and crawlers are developed to collect data from those websites using web mining methods and Scrapy python framework [37], which gives the possibility to create scraping agents to crawl as many web pages as possible with the elimination of repeated documents. Our approach helps decision-makers and analysts to collect data automatically and conduct environmental scanning with no need for manual intervention, which could be a hard task for companies in this era of big data.

C. Data Preprocessing
Preprocessing is an important step to clean data and format it to our needs. After choosing the text field we will use in analysis and the date field that will help us to filter emerging trends, we clean the text from ambiguous characters, then by removing StopWords, stemming and lemmatization, which will help us to get more accurate results and easily interpretable information from raw data. We create n-grams from the corpus to add them to the vocabulary of the corpus. This step is important to enable the clustering algorithms to identify multiterms that may hold an important part of a weak signal, especially in the scientific field. Due to the nature of weak signals, which is low frequency and occurrence of words, we eliminate terms where the count is above a threshold, for example 200 occurrences, as we are not interested in highly frequently mentioned words that, in most cases, represent strong signals or trends, which are not the purpose of our analysis.

D. Data Exploration
The number of topics to be extracted cannot be determined previously as the algorithms used are unsupervised algorithms and the analyst does not have an idea about the number of clusters to be obtained. Therefore, we choose a rule of thumb and we define the number of clusters to be extracted as in eq.1 after the extraction of the vocabulary from the corpus: where n is the number of words in the vocabulary of the corpus.
The determination of an approximate k is an important step in the process of this pipeline. We can specify k based on many techniques of data exploration or using many methods from the literature [38], which is outside the scope of our research, or we can try different values of k and analyze the different clusters obtained. A small number of k though must be avoided in order to avoid the elimination of important potential weak signals that are not heavily cited.

E. LDA
Latent Dirichlet Allocation [39] is a generative probabilistic model for text classification and a topic modeling algorithm that aims at representing the documents as a set of topics, with the objective of assigning each term to a semantically related topic. When applying LDA in Fig. 4 to a corpus of documents, the algorithm tries to cluster the topics and their related terms according to their semantic relationships. It identifies k topics, k is a number specified by the analyst, many methods exist to choose the best k that gives accurate clusters.
LDA algorithm steps are defined as follows, for each document w in a corpus D: 1. Choose N ∼ Poisson(ξ).

For each of the N words wn:
(a)Choose a topic zn ∼ Multinomial(θ).
(b)Choose a word wn from p(wn |zn,β), a multinomial probability conditioned on the topic Those steps are the standard for the LDA model, in order to cluster a distribution of semantically related words to a set of specific topics, in our case those topics may represent innovations, opportunities or threats.
So we will use this algorithm to detect underlying topics in a corpus of documents. Those topics may include weak signals that are not easily identified and are not in the scope of the knowledge of experts, especially in the case of new innovations in a domain. After removing the most frequent terms from the documents, we aim at identifying sets of words that are less frequent and semantically related and belong to the same topic. That's why a maximal number of k is essential to the extraction of latent topics that represent a small proportion of the document, which is the nature of weak signals defined by Ansoff.

F. LSA
Latent semantic analysis [40] is a text mining technique that aims to create a semantic space to identify relationships between words in a corpus of documents. Those relationships are semantically detected using a linear algebra technique called SVD decomposition. Its goal is to decompose a document-term matrix created from the corpus into a lowerdimensional space in order to detect close words and extract coherent topics and similarities between documents.  By applying this technique in weak signal detection we want to detect weak clusters that are appearing in the corpus, those highly coherent and newly emerging clusters may hold important strategic information, they may represent an opportunity for investment and collaboration, or a threat that a company has to plan a strategic response to face it and overcome its consequences.
After the creation of the matrix from the collected corpus by crossing the terms with their corresponding documents, we create an m*n matrix where m is the number of terms and n is the number of documents, then we apply SVD which will decompose the matrix into 3 new matrices as depicted in Fig.  5. (2) Where U is an m*k matrix that holds the word assignment to topics, Σ an k*k matrix which contains singular values that represent the importance of the topic, V* is an k*n matrix that contains the topic distribution across documents. In our case, we are interested in the first two matrices. By crossing the pairs of vectors of the two matrices, we obtain the clusters of topics and their corresponding terms. The clusters obtained in this step will be merged with the previous results to enhance the semantic understanding of the corpus.

G. K-MEANS
K-means is one the most important algorithms for clustering data, the power of this approach resides in its ability to perform unsupervised learning and clustering of data with no prior knowledge, hence the choice of this algorithm in our process to enhance the results of our approach and the quality of the obtained clusters.
As in the previous algorithms, we perform data preprocessing depicted in the previous section before applying k-means. To apply k-means we follow those steps in order to find semantically related terms, relying on the Word2Vec [41] model, and group them in cluster as depicted in Fig. 6:  Cleaning and Preprocessing of text.
 Determination of number k.
 Feature extraction using Word2vec to represent each word semantically as a vector.
 Applying k-means.
 Getting clusters.
In the next step, we will merge the most similar clusters to unify and expand the clusters that are candidates to be weak signal clusters, containing information about potential opportunities or threats that must be noticed.

H. Cluster Aggregation
Cluster aggregation [32] is a method that aims to apply different clustering algorithms on the dataset and find a consensus about the optimum cluster groups in order to eliminate duplicates, and eliminate the noise of each algorithm if it was applied individually, in order to improve the quality and robustness of the clustering.
After extracting the clusters from the previous steps, denoted C1, C2 and C3 from applying LDA, LSA and Kmeans respectively, we move to the merge step which consists of performing a similarity calculation between all pairs to identify similar clusters and merge them in order to eliminate redundancy and enhance the quality of the weak signals detection process by minimizing the disagreements between clusters according to Equation 4.
where v is a set of words or multi-terms and m is the number of all clusters from the applied algorithms.
The implementation of Approximate Similarity Join of Apache Spark MLLib is used, which is based on the Jaccard similarity function eq (4). We calculate it for each pair of all clusters from all algorithms, and if it passes a threshold, we merge the clusters into one, in order to get the cluster that minimizes the number of disagreements.
The resulted clusters from Algorithm 1 are shared with experts and stakeholders to identify potential weak signals from the corpus.

I. Weak Signal Identification
Extracted clusters will pass by the last step, which aims at calculating a score that represents the weighted term evolution inspired by Yoon et.al [27], the evolution rate -er*‖ of each term during a period t of the cluster -Ci‖ is calculated and the sum -eri‖ of all terms represent the score of a cluster, based on that score we can identify the clusters that may hold weak signals represented by semantically related terms from the corpus. www.ijacsa.thesai.org We order the clusters by their score, and based on that score and the interpretation of an expert in a domain, we can spot the clusters that are holding information about the weak signal, which, by interpretation, may be a threat, an opportunity of investment, or an innovation that needs further investigation or collaboration.
In the next section we present the results of applying this flow to the collected corpus, and we discuss the obtained results, advantages and limitations of our framework and we conclude with ideas for future researchers.

VI. RESULT AND DISCCUSION
In order to evaluate the proposed method, we will conduct an analysis on a dataset of scientific articles about -big data‖ topic, we collected a corpus of 5800 documents and scientific articles about « Big Data » containing multiple fields, from the fields we are interested in are abstract field and publication date. We will perform text mining on the text field and perform growth analysis using the publication date.
The purpose of our analysis is to perform the clustering aggregation of three algorithms, K-means, LDA, and LSA in order to combine the results of each algorithm and select from the obtained clusters the ones that are potential weak signals and may hold information about opportunities or threats.

A. Data Collection
We collect data from IEEExplorer, ACM Digital Library, SpringerLink and Sciendirect to show a case study and illustrate the processes of our approach. We use the search query -big data‖ and choose a publication date range from 2000 to 2020, then we collect the documents and articles published in this range of time. We are interested in the abstract, title and publication year of a document as in Fig. 7, the scraper agents extract those fields using the CSS styling of each database website in order to ease the step of homogenization of those fields in the next step.
We create a data frame from the documents containing the three fields we are interested in, and process them in the remaining steps of the framework. Fig. 8 shows an extract of the collected data.

B. LDA Obtained Clusters
After the preprocessing and cleaning step of the articles obtained about big data, we perform the first clustering algorithm, LDA, to obtain k clusters from the corpus. The topics obtained are semantically related and clustered in one group.
A sample of clusters obtained from the corpus is presented (Table I):

C. LSA Obtained Clusters
The application of latent semantic analysis is done. After applying LSA we obtain a different set of k clusters using the matrix decomposition of singular values (SVD). For each cluster, we select a set of terms that represent this concept and that are closely related to it using the singular values in the sigma matrix.
By applying the LSA on our corpus of data we obtain the following clusters (Table II):

D. K-Means Obtained Clusters
The application of k-means results in a set of k clusters after the calculation of word2vec of the text to create a feature of semantically related words. This was used as the measure of similarity between words or terms to perform semantic clustering. The following clusters were obtained from applying this algorithm: In the next step, we will try to merge similar clusters into one cluster and build a cluster group that collects the power of all the algorithms and solves the problems and weaknesses of the other approaches (Table III).

E. Aggregation Algorithm Obtained Clusters
By applying the approximate join similarity, we get the pairs of similar clusters, by merging similar clusters we get p clusters p < k*3, which gives an idea about overall clustering and solve the mistakes that could have been made by using one individual algorithm, the obtained clusters represent all the small topics and semantically related terms that may hold an opportunity or a threat (Table IV). We merge similar clusters into one in order to eliminate redundant clusters and improve the quality of visualization.
In order to visualize the results of the approach, we create a graph from the adjacency matrix term-topic and plot the graph using Gephi to see the clusters and the relationships between them. The graph obtained contains 670 nodes and 12 006 edges. We show an extract of the graph in Fig. 10, and an identified weak signal in Fig. 9 containing semantically related words about the application of big data in health.

F. Interpretation and Discussion
From the interpretation of the results, we can spot the weak signals and hidden information that are not visible to the experts, and by combining their expertise with results obtained, we can identify clusters that are potential strategic information holders and we should cross the data back to the original document for further analysis and understanding of the context of appearance and the identification of the importance of the discovered piece of information.
In our approach we filtered weakly cited words in a specific time, year of publication, from the corpus and applied three semantic clustering algorithms in hope of finding the most accurate clusters by using an aggregation method. Those obtained clusters may contain pieces of information that is crucial to the implementation of an anticipative strategy of an organization. A weak signal is characterized by the evolution of its presence or its number of occurrences through time, which makes it a strong signal in the future, though not all weak signals are destined to be strong.
In Fig. 9 we present the graph representation of filtered words from the corpus, those words are related by their coexistence in the same document and their appurtenance to the same cluster. In Fig. 10, we singled out a cluster so we can study the semantics of this potential weak signal with the help of a domain expert.
We see in Fig. 10 that the semantic cluster of topic 32 is weakly cited and highly rated in the last period of research, which means that this low visibility cluster may be a trend in the future, though we can comment on the choice of number k, which must be chosen wisely and we must experiment with different values of k, or we have to use a different algorithm to determine the optimal value of k that will give promising and accurate results.
Extracted Potential Weak signals must be harnessed to identify threats and opportunities in the market. Our method extracts the most promising clusters of weak signal topics. Using our approach and with expert intervention, we can spot the key information that will generate value for organizations. Though the advantage of this method is not to predict which weak signal will become strong, but to enhance the quality of extracted clusters from the corpus, so we can keep and analyze only the semantic clusters holding potential weak signals through the aggregation of three algorithms: LDA, LSA, and kmeans, this approach will not eventually predict which one will become strong in the future. In order to predict whether a weak signal will become strong, we require labeled data, and with the application of supervised machine learning [42], we can extract the features of weak signals that are candidates to be strong and trend in the future.

VII. CONCLUSION
In this paper, we proposed a novel framework that uses semantic clustering aggregation models, made possible with the use of the computation power of Apache Spark, through the use of ML pipeline which gives the possibility of automating the process of weak signal detection in large volumes of data. The aim of aggregating clustering methods is to combine the features extracted from each method and hide its weaknesses. The use of such a tool in business will allow businesses and stakeholders to remain active and alert in the market. Semantic clustering methods have proven to be very efficient when it comes to topic modeling and extraction of variable topics semantically related from a large corpus of documents. Hence, we judge this approach to be very pragmatic in nature and it does contribute to the domain of weak signal detection.
Our framework will help stakeholders identify and prepare scenarios of intelligence needs. For each potential weak signal, there must be a strategic response ready to tackle it, which will help stakeholders implement an anticipative approach to conduct strategic competitive intelligence in a big data context, where manual extraction and analyzation of documents is impossible in this era where new data is available every millisecond.
Despite our framework contribution in the field we still think that there is more work to be done for future researchers in weak signal detection literature, for example the nature of data to be analyzed in weak signal detection research is unstructured, thus the need for more advanced clustering methods to perform unsupervised machine learning to label data as weak signals from the past data, and apply Text mining Deep Learning models [43] in order to be able to extract and identify weak signals in future data once available online, which will give a competitive advantage for organizations. In our future work we will apply Graph embedding technique [44] [45] as a technology that will allow us to reduce the dimensionality of the corpus and facilitate the semantic representation of weak signals, through the study of dynamic graph embedding to monitor the evolution of a domain terminology through time, in hope of detecting innovation, opportunity or a threat as early as possible.
In conclusion, we must mention the limitation of methods and approaches to validate the extracted weak signals in most of the literature [46]. As a future research project, we can propose a new direction of research in this field through adopting novel semantic clustering algorithms that rely on deep learning like Word2Vec and Glove word embedding [47] for a more precise semantic analysis of the corpus, and proposing novel approaches that relies on labeled data.