Evaluation and Improvement of Procurement Process with Data Analytics

Analytics can be applied in procurement to benefit organizations beyond just prevention and detection of fraud. This study aims to demonstrate how advanced data mining techniques such as text mining and cluster analysis can be used to improve visibility of procurement patterns and provide decision-makers with insight to develop more efficient sourcing strategies, in terms of cost and effort. A case study of an organization’s effort to improve its procurement process is presented in this paper. The findings from this study suggest that opportunities exist for organizations to aggregate common goods and services among the purchases made under and across different prescribed procurement approaches. It also suggests that these opportunities are more prevalent in purchases made by individual project teams rather than across multiple project teams.


INTRODUCTION
Several procurement lapses in Singapore have brought procurement risk into the spotlight [1]. Today, stakeholders are demanding to know whether their money is being spent wisely and not used on fraudulent expenditure. While steps were taken to improve internal control by strengthening procurement guidelines and policies, another approach has gathered paceintegrating data analytics into the procurement process to help prevent and detect fraud. Although this has been the key objective of the application of analytics in procurement, this study has shown that there are other benefits to be reaped from its application. Analytics improve visibility of procurement patterns and empower stakeholders with better insight for developing more efficient sourcing strategies, in terms of costs and effort.
In Organization X (the organization involved in this study chose to remain anonymous for privacy reasons), under their procurement rules and principles, the prescribed procurement approach is based on the Estimated Procurement Amount (EPA) of the intended procurement. This can be summarized in Table 1.  1 Organization X has combined purchases of common goods and services by establishing a Term Contract to yield better value for money through economies of scale. With an Established Term Contract (ETC), Organization X can then procure directly from the appointed supplier(s) when the product or service is required during the contractual period. www.ijacsa.thesai.org The mission of Organization X is to provide effective and timely information and communications technology (ICT) support and solution. In Organization X, procurement of goods and services is proposed by project teams focuses on various areas and projects with different objectives and timelines. Based on the EPA of their intended procurement, their procurement needs are subsequently carried out in separate transactions using the prescribed procurement approaches stated in Table 1. There could be opportunities to aggregate common goods and services across the various areas and projects to achieve possibly higher economies of scale and lessen the administrative efforts. By consolidating frequent purchases into a contractual agreement, Organization X can exploit economies of scale to obtain favorable prices and reduce the transactional overhead of subsequent acquisitions of the same item by performing it upfront. An illustration of the current situation in Organization X and the desired outcome is shown in Fig. 1.
The main purpose of this study is to propose a model where analytics can be applied to detect such opportunities and derive meaningful insights that would lead to improvement in the current procurement process. This paper is organized as follows: Section II provides a literature review of efforts made in application of analytics in procurement and the proposed data mining techniques. Section III presents the data and input variables used in the analysis. Section IV shares the design considerations of the analysis, proposed model and methodology for the analysis. In Section V, the results and discussion of the findings are presented. Section VI concludes the paper and proposed recommendations to the findings and suggest areas for future improvements.

II. LITERATURE REVIEW
Kemp [2] commented that analytics have been used at an advanced level for years to combat fraud in the private sector, especially in financial services. He advocated the following approaches which are proven within the private sector:  Rules-based detection -identifies potential instances of fraud based on behaviors already proven to be fraudulent.
 Anomaly detection -spots unknown or unexpected patterns by comparing like-for-like data within groups.
 Advanced analyticsapplies the latest data, text and web mining technologies to identify fraudulent and errant behaviors that have not already been spotted by rules-based and anomaly detection approaches.
Byrne [3] suggested that much effort has gone into ensuring minimization of fraud, misconduct and other unethical behaviors in procurement and it is time for procurement to start adding value to Organizational strategies and to move from what many perceive as a policing role to a value added role. He added that procurement should be managed strategically and this requires analysis of past procurement spending to determine if you can combine individual purchases to cut costs. National Fraud Authority [5] highlighted a number of inefficiencies in public procurement, some of which were due to departmental autonomy over procurement. It was recommended that government should leverage its purchasing power by seizing opportunities to procure as a single entity. The Organization for Economic Co-operation and Development (OECD) [6] has also identified that savings are being sought, through a variety of measures including centralization of the procurement function, the aggregation of purchases in order to achieve economies of scale.
Chae and Olson [4] discussed the role of analytical capability for sourcing in Supply Chain Management. There is a strong application of analytical IT to support supplier selection within supply chains. Prescriptive analytics has been a key enabler of manufacturer's sourcing-related decision making. Predictive analytics techniques are increasingly available these days for intelligent material planning, inventory management, and supplier relationship management. For instance, advanced machine learning techniques such as artificial neural networks and support vector machines are promising tools to enable effective sourcing. Pattern recognition, when used with large sets of historical purchase orders and supplier delivery data, can reveal hidden facts and potential problems with processes and performances.
Kantardzic [7] proposed that market search, businessintelligence gathering, e-mail management, claim analysis, eprocurement and automated help desk are only a few of the possible applications where text mining can be successfully deployed. Miner et al [9] discussed efforts on clustering in Natural Language Processing (NLP) and how necessary information extraction from the meaning of the text can be performed.
To analyze demand before it can be aggregated, Chia and Chen [8] discussed how Business Intelligence (BI) methodologies using the online analytical processing (OLAP) concept such as drilling, pivoting, dicing and aggregating can be applied to the unstructured content found in procurement databases by performing any ad-hoc query. This would allow users to derive transaction trends at any hierarchy and resolution.
The literature review suggests the following: a) There is no doubt on the benefits that analytics can bring to procurement, beyond its application in the prevention and detection of fraud.
b) The principles behind the analytical approaches applied successfully for fraud detection and prevention could be used to identify opportunities for aggregation of purchases. c) Advanced data mining techniques such as cluster analysis, text mining etc. could be applied in procurement. d) BI tools such as OLAP are commonly used to analyze procurement data. There is limited literature to suggest widespread application of advanced data mining techniques to analyze procurement data.

III. DATA SET AND INPUT VARIABLES
The data set consists of procurement transactions from Year 2011 to 2014 (inclusive), related to the three main types of procurement approaches, namely SVP, OQ and ETC www.ijacsa.thesai.org highlighted in Table 1. This period was selected based on completeness of records (for more accurate insights drawn from the results of the analysis) and recency (for more meaningful follow-up action on the insights derived).
The records for SVP are provided by the Finance Section of Organization X. The Finance Section maintains this information in Excel Spreadsheets. There are 20,861 records for SVP. The records for OQ and ETC are extracted from the procurement databases. There are 267 and 118 records for OQ and ETC respectively.
As there are many data fields available in these records, only the data fields containing information relevant to this study need to be identified. Data fields of interest would include information on description of the purchased items, amount spent, period of transaction, supplier and buyer details.
The records underwent extensive data exploration, manipulation and cleaning to prepare them for analysis: a) SVP i. Filter transactions amounting between $0 and $3,000.
ii. Remove transactions relating to: -Claims (transport, dental, medical, travel) -Training-related payment -Public utilities (Power, water etc.) and telecommunication-related payment -Contract-related payment.
iii. Remove transactions with missing fields.
iv. The number of SVP transactions was reduced to 785 records.

b) OQ and ETC
i. Categorize the transactions by year.

IV. METHODOLOGY
Given the nature of the procurement process in Organization X, it would be interesting to see how the three analytic approaches prescribed by Kemp [2] for combating fraud could be adapted and applied to Organization X"s procurement process to meet the objective of this study: a) Rules-based detection -Logically, if frequent purchases can be consolidated, they can also be split. In order to avoid the rules put in place for a higher EPA procurement approach, which one might perceive as being more stringent in terms of approval and more time-consuming in terms of administrative efforts, a high-value purchase might be split up into multiple low-value purchases to be carried out separately. For instance, an EPA of $10,000 which rules prescribed a procurement approach via an OQ might be split into multiple lower value purchases via the SVP procurement approach. Hence, the scope of analysis should cover the transactions related to the different procurement approaches determined by the EPA of the intended procurement, namely: i. SVP -Goods and services of similar nature can be aggregated into OQs.
ii. OQ -Goods and services of similar nature can be aggregated into OTs.
iii. ETC -While goods and services available under the ETCs are in general already aggregated, a better understanding of how they were purchased will improve procurement planning and possibly reduce the administrative efforts involved in issuing CFQs.
b) Anomaly detection -In the context of consolidating frequent purchases, observations which are beyond the norm i.e. abnormal trends and patterns could be: i.
Recurring transactions of similar goods and services.
ii. Dominant suppliers, in terms of number of transactions and amount spent.
c) Advanced analytics -Text Mining and Clustering techniques could be appropriate given the nature of the data.
In the organization currently, the account codes i.e. expense codes are used to categorize purchases into categories that describe the nature of the goods or services purchased. However, these could be too generic to derive any further useful information on the goods or services purchased. For instance, an IT-related equipment purchase is only categorized into hardware, software or communication equipment and network. In addition, the account codes are allocated by the purchasers on a best-effort basis and these are subjected to judgment errors. Hence, the description fields which contained information on the actual goods or services purchased would reveal more details about the transactions than the account codes.
A key problem with the description fields is that they are captured in free text format. Therefore, they are likely to include a significant amount of irrelevant and noisy information such as dates, names, teams, etc. To the untrained eye, the goods or services purchased would not be intuitively noticeable.
Taking the above into account, a text mining with cluster modelling approach (a conventional data mining technique) using RapidMiner (a software platform that provides an integrated environment for machine learning, data mining, text mining, predictive analytics and business analytics) is used in this study. The text processing algorithms in the RapidMiner"s text mining extension will transform the text data i.e. the description fields into a format that can be easily analyzed using cluster modelling technique. Each record is characterized in terms of the occurrence frequency of words in it and the cluster modelling will place together the records that have a similar distribution of word frequencies. The higher the occurrence of the words would imply a higher number of transactions i.e. purchased more times. However, it is important to bear in mind that a higher number of transactions might not necessarily mean that a larger amount is spent correspondingly. In other words, a drawback of the text mining with cluster modelling is that it would not be able to tell whether a specific good or service with many transactions is of relatively small value. Similarly, it would also miss out a specific good or service with few transactions but of relatively higher value. www.ijacsa.thesai.org  ii. Tokenize Linguistic (Tokenize).

c) Select Attributes -Only columns with numeric values are selected for clustering (due to the choice of the clustering technique, K-Means, in the following step). d) Clustering -K-Means clustering algorithm is used to group the records in terms of the occurrence frequency of words in it. K-Means is selected for its simplicity and speed among the clustering techniques.
e) Write Excel -The output i.e. cluster groups is saved into an Excel file so that they can be combined with the other data fields i.e. amount spent, period of transaction i.e. Year, supplier and buyer details to derive further insights.
For each group of data (SVP, OQ and ETC) the methodology for the analysis is as follows: a) Approach 1 i. Input the data into the designed process in Rapid Miner.
ii. Run the process.
iii. Examine the results -Word List, Cluster Model and Centroid Table. iv. Re-calibrate the parameters accordingly and repeat from Step ii. Compare the results of multiple runs with different k classes and choose the best one.
v. Combine the results of the selected cluster with other data fields i.e. amount spent, period of transaction, supplier and buyer details.
vi. Explore and analyze the results for insights. b) Approach 2 i. Tabulate the data by the supplier details against the period of transaction, amount spent, number of transactions made and the buyer details.
ii. Identify suppliers from whom purchases were made in every year.
iii. Extract the records for the identified suppliers.
iv. For those with many transactions, follow the steps in Approach 1 using the data obtained in Step 3.
v. Explore and analyze the results for insights.
Approach 1 begins with the identification of the common goods and services purchased. The output is analyzed together with related information on the amount spent, period of transaction, supplier and buyer details. Approach 2 begins with the identification of suppliers based on value and volume of transactions. Both approaches incorporate further data points to address the drawback of the model which would not be able to tell us whether a specific good or service with many transactions is actually of significance. Both approaches aim to complement each other and their results can be compared against each other for verification and completeness when drawing the conclusion.

A. SVP Approach 1
An examination of the Wordlist generated from all the records, sorted by total number of occurrences and number of documents that contain these words, revealed high occurrences of irrelevant words such as purchase, service, supply, etc. Including these words in the clustering process will affect the results. As such, these words will be identified and added into the "Stopwords" list under "Filter Stopwords (Dictionary)". The cleaned-up wordlist now presented words such as PhoneX, printer, TabletX, cable, rubber stamp, book, fruit, camera, biscuit, BTH, bowl, certificate, screen, etc. with highest occurrences. This provided a clearer indication of the purchases made and the keywords that should be watch out for in subsequent steps of the analysis.
In determining the'k' value (i.e. the number of clusters for k-mean clustering), the "rule of thumb" [10] (k=√(n/2) where n is the number of data points) was used as it is a quick and simple method. For 785 records, k=20 was derived. For verification and comparison purposes, additional runs were also made for k=25 and k=30.
The Centroid Cluster Model in Table 2 shows the results, interpreted based on the term frequency of the keywords generated from each cluster (for k= 20, 25 and 30). The results at k=25 and k=30 seem to indicate that the keywords in the majority of the clusters were recurring e.g. PhoneX, screen, fruit, juice, biscuit, printer, book etc., indicating a strong presence, albeit further breakdown of each into smaller clusters e.g. fruits appeared in Cluster 3, 19 and 21 at k=30. New keywords which appeared in the cluster at k=30 consist of fewer items which were not significant. www.ijacsa.thesai.org Cluster for k=25 was selected for further analysis since it covered most of the keywords generated from the different "k" values. Using the excel file generated, the results of the clustering is combined with other data fields i.e. amount spent, supplier and buyer details for further analysis. A closer examination of the clusters revealed that for most of the clusters, they were not perfect i.e. not all similar goods and services were grouped together by the clustering process. For example, in Cluster 0 shown in Table 3, the clustering is probably based on the occurrences of the words "Certificate" but these items are distinct, it consist of ET Certificates, HT certification and Electrical Certification. The clusters would have to undergo further verification by a keyword search of the description field based on the keywords identified. This step is done manually to determine the final clusters.
The identified clusters, types of good or service, number of suppliers, frequency of purchases (by Year), teams which made the purchases and amount spent are summarized in Table 4. For goods and services where total amount spent is more than S$3,000, they present possible opportunities for the aggregation of purchases via OQs to achieve economies of scale or improve administrative efficiency in purchasing when the product or service is required during the contractual period (via establishing an ETC). Goods or services provided by a single supplier made every year such as X developer program, ET Certificates, Security Holographic Stickers and HT services and accessories are such examples. It is interesting to note that for majority of the transactions, purchase made for a specific good or service was by a single team, contrary to the earlier assumption that aggregation of purchases could be made across teams.

Approach 2
The nature of SVP meant that numerous purchases could be made from one supplier, either within a single year or across different years. The next step is to investigate suppliers from whom purchases were made from them every year. These suppliers, the frequency of the transactions with them, amount spent and the types of goods and services are summarized in Table 5. It is observed that most of the goods and services identified in Approach 2 were all present in Approach 1 except for two main ones, "Standby Technician Support for Video Conferencing System" and "Rental of Cherry Picker". This is probably because each appeared as one single transaction only in most of the years; hence the term occurrence in the Text Mining analysis was low i.e. it did not feature significantly in the generated Word List or Centroid Cluster Model. However, these observations were noteworthy as they were repeated purchases of relatively significant values in the context of SVP (averaging between S$2,500 and S$2,800) in most years.

B. OQ Approach 1
Similar the approach used in SVP, irrelevant words were removed from the wordlist to improve the result of clustering. The cleaned-up wordlist, sorted in terms of total and document occurrences, now presented words such as licenses, servers, network, CCTV, video, audio, anti-virus, TabletX etc. with highest occurrences.
In determining "k" i.e. the number of clusters for k-mean clustering, the "rule of thumb" is used. For 267 records, k=12 was derived. Similarly for verification and comparison purposes, additional runs were also made for k=20 and k=30. The Centroid Cluster Model in Table 6 shows the results, interpreted based on the term frequency of the keywords generated from each cluster (for k= 12, 20 and 30). The results at k=20 and k=30 seem to indicate that the keywords in the majority of the clusters were recurring e.g. licenses, servers, TabletX etc., indicating a strong presence, albeit further breakdown of each into smaller clusters e.g. licenses appeared in Cluster 0,15, 24 and 29 at k=30. New keywords which appeared in the clusters consist of fewer items.
Cluster for k=30 was selected as a basis for further analysis since it covered most of the words generated. The further breakdown of the licenses into smaller clusters was also useful. After the combination of the results of the clustering with other data fields i.e. amount spent, supplier and buyer details for further analysis, a closer examination of the clusters revealed that for most of the clusters, they were not perfect i.e. not all similar goods and services were grouped together by the clustering processsimilar to what was observed in SVP.
The identified clusters, types of good or service, number of suppliers, frequency of purchases (by year), teams which made the purchases and amount spent are summarized in Table 7. Notwithstanding the fact that Organization X might have already made efforts to consolidate these purchases (some of these purchases did not occur in 2014), for goods and services where total amount spent is more than S$70,000, they present possible opportunities for the aggregation of purchases via OT to achieve economies of scale. It is interesting to note that for majority of the transactions, purchase made for a specific good or service was by a single team, contrary to the earlier assumption that aggregation of purchases could be made across teams. Opportunities for aggregation of purchases across teams include anti-virus licenses, servers, rental of network equipment, maintenance of Video Conferencing System and maintenance of UPS. It was also observed that some of the items that are bought under OQ were also bought under SVP e.g. printer, Hard Disk and TabletX.

Approach 2
Unlike the SVP, goods and services purchased via an OQ could last for a contractual period beyond one year due to its higher value, for instance, putting in place an ETC where the department could procure directly from the appointed supplier(s) when the product or service is required during the contractual period. Therefore, identifying only suppliers where purchases were made from them every year would not be sufficient. Instead, the cumulative value of the contracts awarded over the period of the four years would be used as a measure to identify suppliers. As a benchmark, a cumulative value of $70,000 is used. These suppliers, sorted by total cumulative amount spent and total number of transactions is shown in Table 8. The remarks column indicated the predominant type of good or service purchased.
It was observed that a specific type of good or service is usually provided by a dominant supplier, more for services than for goods, e.g. supply and maintenance of IT systems. However, it was interesting to note that the different types of software licenses required by different teams are provided predominantly by a single supplier, Supplier 13. Another noteworthy observation was the maintenance of several IT systems by a single supplier, Supplier 14. This provided evidence that there are opportunities for different goods and services to be aggregated to achieve better pricing, contrary to the earlier assumption that aggregation of purchases could be made for similar goods and services only.
Approach 1 was more effective at identifying purchase of common goods and services across different suppliers e.g. for provision of Anti-Virus licenses, Approach 1 identified five different suppliers while Approach 2 identified one only (without going through the entire list of suppliers). Approach 2 was effective at identifying the dominant suppliers and provided insights such as consolidation of different goods and services which would have been missed out using Approach 1.

C. ETC Approach 1
Similar to previous approach, irrelevant words were removed from the wordlist to improve the result of clustering. The cleaned-up wordlist sorted in terms of total and document occurrences, now presented words such as licenses, engineers, engineering and project management, UPS, License I, License O, switch etc. with highest occurrences. This gave a clearer indication of the purchases made and the keywords (items) that should watch out for in subsequent steps of the analysis.
In determining "k" i.e. the number of clusters for k-mean clustering, the "rule of thumb" is used. For 118 records, k=8 is used. For verification and comparison purposes, additional runs were also made for k=12 and k=16.
By examining the Centroid Cluster Model in Table 9, the results, interpreted based on the term frequency of the keywords generated from each cluster (for k= 8, 12 and 16), are shown in. The results after runs at k=12 and 16 seem to indicate that the keywords in the majority of the clusters were recurring e.g. licenses, servers, racks, Engineering and Project Management etc., albeit further breakdown of each into smaller clusters e.g. licenses appeared in Cluster 7,8 and 15. At k=16, new keywords which appeared in the clusters consist of fewer items, with quite a few clusters with zero or one item.
Cluster for k=12 was selected as a basis for further analysis since it covered most of the words generated from the different "k" values. After the combination of the results of the clustering with other data fields i.e. amount spent, supplier and buyer details for further analysis, a closer examination of the clusters revealed that for most of the clusters, they were not perfect i.e. not all similar goods and services were grouped together by the clustering processsimilar to what was observed in SVP and OQ.  Similar verification and manipulation of the clusters would have to be adopted similar to what had been prescribed previously. Goods and services available under the ETCs are in general already aggregated. A better understanding of how they were purchased will improve procurement planning and possibly reduce the administrative efforts involved in issuing CFQs e.g. by combining CFQs from teams or multiple year CFQs. However, it is also possible to achieve economies of scale as suppliers are known to offer better pricing than those stated in the ETC.
The identified clusters, types of good or service, number of suppliers, frequency of purchases (by year), teams which made the purchases and amount spent are summarized in Table 10. These goods and services present possible opportunities for the aggregation of purchases. It is interesting to note that items which were bought under ETC were also bought under OQ e.g. O and SY licenses. However, it is beyond the scope of this project to investigate the possible reasons for this. There are also possible evidence to suggest efforts made in the aggregation of goods and services e.g. in the OQs, there were rental and purchase of Network equipment made from 2011-13 but this stopped in 2014 and similar purchases were made under ETC in 2014. Another observation is that while there were outright purchases for certain goods in ETCs, the maintenance of similar goods was procured under OQ e.g. servers, UPS. However, there is no evidence to prove that these were the same equipment bought under the ETCs which were later maintained under the contracts established via OQs. For majority of the transactions, purchase made for a specific good or service was by a single team, contrary to the earlier assumption that aggregation of purchases could be made across teams. Opportunities for aggregation of purchases across teams to include purchases of licenses (O, I, and SY), servers, Engineering and Project Management Services.

Approach 2
For ETC, the identification of a supplier based on its value and volume of transaction is less relevant in the analysis because it would already have been known upfront at the point where the ETC was established, the goods and services it is offering even before the CFQ was issued. Hence, Approach 2 will not be applicable in the analysis.

Managerial perspective
The findings suggest that opportunities exist for Organization X to aggregate common goods and services among the purchases made under SVP, OQ and ETC (Table 4, 7 & 10). The analysis further suggests that these opportunities were more prevalent in purchases made by individual project teams rather than across multiple project teams. However, it must be acknowledged that in reality, circumstances such as different timelines/deadlines of projects, unanticipated changes and dynamic requirements from stakeholders make such procurement planning in the short term very challenging. These could be the most likely reasons for the separate transactions for similar goods and services detected. There were some indications suggesting that Organization X has undertaken efforts to consolidate frequent purchases e.g. for recurring purchases of O licenses, rental of network equipment, no such transactions appeared in the 2014 OQ list of transactions while appearing in the 2014 list of ETC transactions.
The results obtained from the analysis should increase Organization X"s awareness and improve its visibility of the goods and services it has been procuring in recent years. It is recommended that these lists of identified goods and services to be shared with the different project teams to facilitate longterm procurement planning within teams and better synergy in coordinating procurement efforts across teams.

Research perspective
The use of advanced data mining techniques such as text mining and cluster analysis complements the OLAP tools commonly used in analyzing procurement data. It addresses the inadequacy of OLAP tools in generating new information from textual data. However, it does have its shortfallsthe nature of the natural language texts contains ambiguities and it is still difficult to analyze the semantics and to interpret meaning if the keywords e.g. a specific cluster might have both a purchase of an Apple computer and apple (fruit) grouped together. Keywords in the description fields which are spelt incorrectly and captured are useless, as the saying goes "garbage in, garbage out". Its output is not an end in itself. The process is most rewarding when the data text mining generates can be further analyzed by a subject matter expert, who can bring additional knowledge for a more complete picture. Text mining (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 6, No. 8, 2015 80 | P a g e www.ijacsa.thesai.org can create new relationships and hypotheses for further exploration.
The cluster analysis, while useful in grouping most of the transactions and offered a general overview of the types of goods and services purchased, required further manual manipulation of the clusters to derive more accurate output in order to derive meaningful insights.

Further improvements
To improve the output from the text mining analysis, more efforts could be made to experiment with the various text processing algorithms in the RapidMiner"s text mining extension. The better the quality of the output i.e. keywords identified, the more accurate the clustering of the records.
Data fields such as the account codes could be used to categorize the records at a broad level before applying the text mining and clustering analysis, improving the semantics of the keywords extracted from the description fields. More clustering algorithms could be applied to compare the accuracy of the clustering output. In this study, only K-means clustering was used.