Fraud Detection in Shipping Industry using K-NN Algorithm

The shipment industry is going through tremendous growth in volume thanks to technological innovation in e-commerce and global trade liberalization. Volume growth also means a rise in fraud cases involving smuggling and false declaration of shipments. Shipping companies and customs are mostly relying on routine random inspection thus finding fraud is often by chance. As the volume increases dramatically it would no longer be sustainable and effective for both shipment companies and customs to pursue traditional fraud detection strategies. Other related papers on this area have proven that intelligent data-driven fraud detection is proven to be far more effective than routine inspections. However, the challenge in data-driven detection is its effectiveness are often reliant on the availability of data and the various fraud mechanism used by fraudsters to commit shipment related fraud. As such in this paper, we review and subsequently identify the most optimized approaches and algorithms to detect fraud effectively within the shipping industry. We also identify factors that influence fraud activity, review existing fraud detection models, develop the detection framework and implement the framework using the


I. INTRODUCTION
World Customs Organization's (WCO) Illicit Trade Report 2016 states that the year 2016 was marked as the year of Digital Customs where the administrations were encouraged to actively showcase and promote the use of Information and Communication Technologies (ICT) by using a data-driven approach to collect and safeguard duties, control the flow of goods, people finally to secure cross-border trade [1]. CIO Magazine from IDG identified fraud detection as one of the IT projects primed for machine learning [2]. As such, there is a pressing need within the industry for a more intelligent fraud detection system that can considerably improve the detection of wrong declarations and smuggling compared to random checks.
Liberalization in trade and technological innovation such as advancement in e-commerce has accelerated international shipping volumes significantly over the last few years. The latest statistics from The International Air Transport Association (IATA) released full-year 2017 data for global air freight markets showing that demand, measured in freight tonkilometers (FTKs) grew by 9.0%. This was more than double the 3.6% annual growth recorded in 2016 [3]. Such a rapid rise in volumes will lead to an increase in safety and compliance issues as tremendous volume creates a strain for both the shipping companies and customs authorities to perform safety and compliance audits on most shipments. Meanwhile, on the end of the spectrum, customers are demanding e-commerce providers and shipping companies to provide faster deliveries. By fulfilling urgent deliveries shipping companies are also able to expand their business to a new market segment such as urgent medical deliveries which provides a higher margin of profit to shipping companies. Thus, placing more manual checks on shipments by both shipping companies and customs will only cause more delays on shipments which directly impacts the profitability of shipping companies and their customer base. As such, there is an urgent need to automate and increase the effectiveness of the current random checking done by both the shipping companies and the customs. There are various violations committed by shippers such as illicit trade, smuggling that violates shipping restrictions between countries or miscoding of the shipment items shipped that saves custom duties payments. There could be also security and safety issues if shipments are not audited thoroughly. Imagine the impact of dangerous or flammable goods such as mobile batteries being declared as safe goods in an air freight shipment that can cause devastating impacts such as plane crashes. According to the UK P&I Club, 27% of incidents on cargo ships in 2013 and 2014 were attributable to mis-declared hazardous cargo, second only to poor packaging [3].
Issues and challenges highlighted in [4] [5] paper are generally concept drift or dynamic fraud patterns, overlapping data, capability to support real-time detection requirements, skewed distribution, integrating a vast amount of data, and data quality-related issues. Concept drift is the challenge to deal with sudden customer behavioral changes which could turn out to be a false positive outcome. The solution to overcome concept drift is to use an adaptive FDS algorithm that learns and improves over time by factoring in all the possible input variables that may influence the change in expected behaviors. Overlapping data is the issue where fraudulent transactions are made to look like genuine data which becomes true negative cases. The skewed distribution is the issue of having a very low ratio of fraudulent cases which may not be sufficient to train supervised classification-based FDS algorithms. Data quality issues also need to be reviewed as this factor directly impacts the efficiency of fraud detection.
Each issue and challenge impact the respective fraud domain area differently. There are mainly 5 business domain areas where FDS has applied namely banking, telecommunication, insurance, online business, and shipping [6]. The specific area involved in banking is credit card-related fraud. Meanwhile, in insurance, the specific area will be medical insurance claims and vehicle insurance claim-related fraud [7]. In an online business, the typical fraud area is online auction-related fraud. Shipping-related fraud is usually related to smuggling and miscoding which is a false declaration of the goods being shipped. The critical issue for the shipment domain is getting efficient real-time results within a huge data set. The bigger the data set the better the efficiency thus we need to have a solution architecture that can process an optimal volume of the desired dataset that is efficient enough to be executed in a real-time mode. In the express shipping domain, accuracy and real-time performance is very critical as the life cycle of a shipment only varies between 3-5 days depending on the weight and location. Thus, identifying the fraud before the shipment gets delivered is very critical. The earlier the shipments are intercepted the bigger the cost benefits for both organizations and customers. Immediate detection avoids revenue leakage and improves customer's trust and confidence towards the organization's brand. It also ensures fraud culprits are identified effectively and handed over much earlier to authorities that may help to reduce future fraud cases.
To build a solution that detects fraud effectively we need to identify the parameters or the data elements that influence the most in actual fraud cases. In a study done in a leading global logistics company, it was identified that location is one of the key parameters that influence fraud cases. The location for shipment can be either the origin or the destination of the shipment. According to a report published by World Customs Organization (WCO), shipment origin and destination location can be a major factor as most frauds tend to originate from or being sent to a specific location. Based on data provided by a major global logistics organization the data that will be extracted for our simulation will be the origin and destination respective latitude and longitude values. Since these values are numerical it can precisely identify a location. With these precise numerical-based location values, a specific fraudulent shipment origin or destination and its surrounding area within a defined radius will be tagged as fraudulent by the algorithm. By having numerical data, the processing speed of the algorithm will also be much faster as opposed to using text or image data.

II. LITERATURE REVIEW
In subsequent sections, we will be reviewing various papers that are related to fraud detection.

A. Methods
The search strategy is the definition and selection process to find the most relevant papers are described in the following. The digital databases searched in this review include IEEE Xplore, Springer, and Science Direct. The reason for the selection of these four databases is due to the availability of highly cited and reliable papers in the fields of computer science and its related applications. The review objective is to find all primary research work associated with fraud detection systems within the shipping or logistics domain. The earlier phrase that was searched is "fraud detection system and shipping or cargo or freight or logistics" but since there were not many fraud detection system papers in the shipping domain thus most of the returns were only relevant to fraud detection. There were only 3 papers related to the shipping domain [8]. Finally, only the term "fraud detection system" was used. The initial query resulted in a total of 5866 papers: 598 from IEEE Explore, 964 from Science Direct, and 4304 from Springer. The filtered articles were published between 2000 and 2018. For Science Direct and Springer besides the year filter based on the topic is also applied to ensure non-computer sciencerelated papers are excluded. The reason the year was narrowed down between 2000 and 2018 was due to the no of results which came up to thousands. After sifting through some of the papers we have divided into survey papers and specialized papers which specifically delve into specific techniques or business domain area such as financial which is a credit card or insurance, healthcare, telecommunication, and internet-related marketing fraud.

B. Review
The earliest survey paper since the year 2000 is the paper from [9]. This paper reviews fraud detection from a statistical perspective. Just like most fraud detection-related papers this paper also categorizes basic statistics models for fraud detection methods into supervised and unsupervised. Besides categorization by models, it also surveys papers based on application area or domain. Among the application covered are in the area of credit card fraud, money laundering, telecommunications fraud, computer intrusion which is also known as hacking these days, medical and scientific fraud which also includes plagiarism in the education sector. This paper concluded that the key issue in fraud detection is the effectiveness of fraud detection. Factors such as the speed of detection are directly related to its effectiveness. As such a strategy to use a graded system of investigation is suggested where areas with very high suspicion and high fraud value merit immediate and intensive investigation. This paper also concluded that fraud detection can be achieved even in difficult circumstances but there are also many challenges and opportunities waiting to be tapped in the future. In 2004 another fraud detection survey paper by [10] was published. This paper focuses more on fraud detection techniques. Domain areas covered are credit card fraud detection, telecommunication fraud detection, and computer intrusion detection. Common techniques applied in credit cards are outlier detection which is an unsupervised method that does not rely on historical data. Outliers are based on observation of deviation against the normal or average pattern. It's suitable to detect fraud that has not previously occurred. To detect fraud pattern which previously occurred then supervised method using historical labeled data are used. Neural network-related techniques which is a set of interconnected weighted nodes designed to function like a human brain are also applied widely for credit card fraud detection, but this technique requires an actual data set that is rarely made available to the public. For computer intrusion detection several techniques such as expert system, neural networks, model-based reasoning, data mining, and state transition analysis are applied. The challenge in the computer intrusion domain is to deal with heaps of the audit trail data, dealing with false alarms rate, difficulty in testing, simulating potential scenarios, and poor portability as the ruleset is very specific to a particular environment. Lastly in telecommunication fraud detection among the techniques used are rule-based, a neural network that includes Bayesian network and also visualization methods. The challenge of managing the data load in supervised learning for the rulebased and neural network can be mitigated by using unsupervised learning to filter out normal behavior data. To create a more robust selection process for rule base technique a non-greedy rule-selection approach can be explored further. The telecommunication environment is very dynamic and always evolving thus it requires accurate definitions of thresholds and parameters that in tune with the changing landscape of this domain.
In 2010 data mining-based fraud detection research surveyed papers from the year 1998 till 2010 [11]. This paper also highlights types of fraudsters and affected industries. The type of fraudsters is divided broadly into managers, employees, or external parties. The most challenging fraudsters are the external parties as they are many of them and they can make use of various complex and new fraud mechanisms. This is the area where we need to apply data analytics or data mining techniques as it will be cost-effective compared to conventional manual methods to find the riskiest parties by using suspicion scores, rules, and visual anomalies that can be investigated and refined. This paper also identifies the fraud domain area as internal which is fraud committed by management and staff within the organization, insurance, credit card, and telecommunications. Credit transactional fraud detection has received the most attention from researchers. There are also other emerging fraud areas such as e-business and e-commerce related fraud in the online world. Two main challenges of data mining-based fraud detection research are the lack of publicly available real data to perform research on and also the lack of well-researched methods and techniques. To overcome the challenge of data availability a solution to use simulated data that closely matches the actual data which are often very sensitive to be shared in public domains. These were proposed in some papers such as [12] [13] [14] [15]. To overcome the issue well-researched methods and techniques some performance matrices and measures are critical to ensure fraud detection gets well-deserved attention from business stakeholders to invest and provide funding that flourishes research and development in these initiatives. Among the measures taken are such as placing a monetary value on predictions that can maximize cost savings/profits by having their own cost and benefit model customized according to their respective business needs. Other considerations to determine the methods of fraud detection are speed of fraud detection and also the styles/types of detection such as online/real-time or batch mode.
In early 2016, Abdallah et al. [4] released a survey paper that covers papers from 1997 to 2014. This paper provides a good summary of the matters surrounding the fraud detection system. Fraud is defined by the Association of Fraud Examiners (ACFE) as the use of one's occupation for personal enrichment through the deliberate misuse or misapplication of the employing organization's resource or assets. There are 2 main types of fraud systems namely fraud prevention systems and fraud detection systems. Fraud prevention is the first line of defense against fraud which blocks the entry of any fraudsters into the system. Meanwhile, fraud detection is the next layer of defense that detects fraudsters who have already committed the fraudulence act. Over the years many fraud detection approaches and techniques have been applied.
In [16], the paper studied the highest level grouping been categorized by area of study. The 2 main groups are statistical modeling and machine learning. Statistical modeling is an area of mathematics that deals with collecting and analyzing data with some assumptions. Machine learning is a technique using programming algorithm models that learn from data and solve complex problems. There are 2 main methods of machine learning namely supervised and unsupervised types. Some approaches combine these two techniques which are known as semi-supervised. The difference between supervised and unsupervised is in the use of labeled data in supervised as oppose to unsupervised which does not use any labeled data. Labeled data is the identification of fraud data in the data set that are used to train the algorithm or model. Unsupervised techniques rely on a grouping of similar attributes or finding outliers that can identify unusual behavior or patterns that can be further investigated. An overview of the various methods and techniques is illustrated in Figure 1. Table 1 provides a summary of fraud detection data mining tasks with commonly used algorithmic techniques and example use cases [17]. Table 2 provides a comparison summary of fraud detection data mining algorithms that would help to identify the suitable algorithm that can be applied in this paper's use case [17].
Another potential mechanism that could be used in Fraud detection is using multi-agent systems [

III. SOLUTION DESIGN
The processes defined for the proposed model as shown in Table 3. First, the fraud data will be prepared, normalized, and cleaned up by replacing missing values and removing duplicates. The data is simulated based on a study done in a global logistics organization based on their historical shipment origin and destination data for 5 years from 2012 till 2017. There are 5 columns namely Fraud label, Origin City Latitude, Origin City Longitude, Destination City Latitude, and Destination City Longitude. Origin or destination with high cases of fraud is tagged with fraud field as "Y". The number of rows simulated is 1500 records. A snapshot of the data is shown in Table 3.
Once the data preparation is completed data will be fed into the process model. Fraud attribute shall be labeled as target role. Once the label has been set up the validation process can be configured. Split type can be set as relative, and the ratio can be dynamically changed from a range of 0.6 to as 0.8 to increase the accuracy. A split ratio of 0.75 means 75% of the data will be used as training the algorithm and the remaining 25% data will be used to test the trained algorithm. Model design is illustrated in Figure 2. The flow in blue is using the labeled data are iterated with various combinations of the split ratio, various algorithms, and parameters related to the specific algorithm. Tuning of these variables will produce results that can be measured in terms of accuracy within an acceptable execution time.
Once an acceptable performance is achieved the algorithm chosen together with its known parameters can be applied to every new incoming shipment data. In this way, fraudulent shipments can be detected at the time the shipment is still in progress within the network before it gets delivered. The algorithm will be updating the prediction column to flag fraudulent shipments accordingly. Shipments flagged as fraud can be further investigated to check if it is genuine. If it's wrongly flagged further analysis needs to be done to improve the algorithm in the future. The analysis also needs to be done on shipments that were not flagged by the algorithm, but it was later found to be fraudulent. Various tools can be used to perform fraud detections in the market like R, Rapid-Miner, SAS Enterprise Miner, IBM SPSS, etc. In this paper, we have selected Rapidminer as its simple to use and it also has many built-in ready-to-use algorithms. This allows various techniques can be tested against the available data. Furthermore, it's also provided as a freeware version for students. Once the data is ready then the Rapidminer tool can be used to set up the model. Data used for this modeling will be as per below.
Among the algorithm tested are Naive Bayes, Neural Net, Deep Learning, Decision Tree, Logistic Regression, SVM and finally k-Nearest Neighbors or k-NN as shown in Figure 3.
After several executions with various algorithms and split ratio combination, it was found that the optimal best result with the highest accuracy of about 98.4% was achieved using the k-NN algorithm using default parameters as shown in Figure 4.
As shown in Figure 5, in terms of execution speed it's found that most of the algorithm immediately returned the result except for Neural Net that took almost 2 seconds, and Deep Learning that took 6 seconds. As such these 2 algorithms are not suitable for fraud detection within the shipping domain as speed is one of the key criteria.
The above Figure represents the relationship between the k-NN key parameter which is the k nearest neighbor number of classes against the accuracy of the prediction. It's was found that the highest accuracy was recorded when k is either 1 or 2. Accuracy starts to drop once k is increased beyond 2. Figure 6 represents the relationship between the k-NN key parameter which is the k nearest neighbor number of classes against the accuracy of the prediction. It's was found that the highest accuracy was recorded when k is either 1 or 2. Accuracy starts to drop once k is increased beyond 2. in this study, the k-NN algorithm has been identified as the best optimum results in terms of accuracy and speed criteria that were required in fraud detection within the shipping domain. Results in Figure 6 illustrates that genuine fraud was detected correctly for 88% of the total cases. Nonfraud cases were predicted correctly at 99.14% of the total cases.   As such to achieve high accuracy with optimum performance k-Nearest Neighbour algorithm technique is proposed to detect fraud in the shortest possible time within the shipping domain. This technique is proposed based on the modeling simulation done in Rapidminer. This algorithm indicates that this technique usually provides an acceptable response time during execution which is within a second. Thus, the model for this solution will be as shown in Figure 7 below where the shipping data will be first pre-processed to ensure there are no missing values. The pre-processed data will be then split between training and test data. The data set available can be split between training and test data with 75% for training the algorithm and 25% for testing the algorithm which is also close to the split recommended by [17].
Performance Evaluation with new incoming data and subsequent execution with more historical data. New unlabeled data can be routed to the model to predict if it could be fraudulent as shown below in 7e 7. The set of new data will be analyzed by the trained algorithm and will predict each row of data with origin longitude latitude and destination longitudelatitude with a prediction flag to each row. Suspected fraudulent locations will be tagged as Y and non-fraudulent will be identified as "N". The identification is primarily based on how close the locations to the labeled fraudulent cases are provided in the learning stage. Thus, if new locations are present in the data these data will not be identified as a fraud as there is no historical data linked to it. To resolve this challenge new location data can be identified in outlier techniques and analyzed distinctly before it's can be used as part of the main dataset. Alternatively, if the customer data profile is available k-NN distance-based outlier techniques can be applied by looking for any outlier locations within a customer's shipment data set. If there are no new locations expected from a particular customer and if there are any new locations detected within the customer's data set then this data can be classified as potential fraud using outlier techniques. As such combining machine learning, k-NN and outlier techniques can be a complementary strategy to increase the effectiveness of fraud detection in shipping domains.

V. CONCLUSION
As a conclusion, we are recommending k-NN algorithm machine learning to address fraud detection within the shipping domain. It's proven that fraud detection using machine learning is much more efficient compared to manually identifying fraud while a shipment is in progress. Identifying fraud before the shipment arrives at the destination is very crucial to ensure that fraud items do not get delivered to the consignee. This is only possible by automating the fraud detection process as some international shipments can be delivered within the same day depending on the location. Using the k-NN algorithm ensures fraud can be detected within the duration of shipment which is an effective way to stop the current fraud and to reduce future fraud cases. To overcome some challenges in identifying new cases of fraud k-NN machine learning technique can be combined with distance-based outlier techniques on data set that are grouped by customer profiles. Getting hold of actual shipping data from logistic companies was quite challenging due to the sensitive nature of shipping data. As such exploration of the various detection approaches, analyzing the strength and weakness of each before choosing the most optimum approach was done with simulated data which was based on parameters identified in a study done in a shipping company. Future papers may use these approaches and algorithms from this paper to simulate and perform further testing with actual data if they have access to it. Besides location parameters which were used in this paper, other parameters influence fraud in the shipping domain such as shipment weight, payment method, and the profile of the customer which was not evaluated in this paper due to lack of actual production data. As such in future papers these parameters can be considered to get results with higher accuracy.