Household Overspending Model Amongst B40, M40 and T20 using Classification Algorithm

—The family economy is a critical indicator of the well-being of a family institution. It can be seen by the total income and how well the household finances is managed. In Malaysia, the household income level is categorized as B40, M40 and T20. These categories can also indicate the poverty level of the household. Overspending is a phenomenon where the monthly expenses are more than the household's total income, which affects economic wellbeing. Finding important factors that affect the spending patterns among the household can reveal the causes of overspending. It will assist the government in mitigating such problems. Availability of 4 million household expenditure records obtained from the survey conducted in 2016 by the Department of Statistics Malaysia eases the aim of this study to develop a household overspending model by using machine learning. The model is developed using 12 household demographic attributes with 14451 household records. The attributes are the number of households, area, state, strata, race, highest certificate, marital status, gender, housing, income, total expenditure, and category as attributes class. The model development employs five machine learning algorithms namely decision tree, Naïve Bayes, Neural network, Support Vector Machines, Nearest Neighbour. The results show that the decision tree through J48 algorithm has produced the easiest rule to be interpreted. The model shows four attributes which were income, state, races and number of households that highly influence the overspending problem. Based on the research finding, it can be concluded that these attributes are essential for improving the indicator measure for Malaysian Family Wellbeing Index in the aspect of overspending.


I. INTRODUCTION
Malaysian Family wellbeing index consisted of 8 indicators, with the economy being one of it [1]. The wellness of the economy can be measured based on two indicators; income and expense. When a person spends more than his/her income, the phenomenon is called overspending. Overspending is a continuous issue for ages [2] [3], which resulted from verities of factors making the issue to be rapidly increased yearly [4]. Overspending may cause by the basic need of a family. However, it also due to lifestyle with the attitude of not being able to self-measure and being critical on the lifestyle. Undeniably, overspending can occur due to an unexpected event that might occur once in a blue month. However, it was not supposed to be happening on a monthly basis.
Overspending is expected to be experienced by most millennials due to the lack of knowledge on how to spend money wisely. With various facilities such as multiple bank accounts, insurance and saving plan causes millennials to easily trapped in overspending. Moreover, with high-cost university fees, youngsters nowadays are facing huge debt even before securing a job. Additionally, unlimited access to online shopping also contributes to this matter. Other than that, using delivery food services causes food expenses to be increased. Bankruptcy is an even more serious consequence of overspending. Forty percent of households in the USA faces with overspending since 1990. The number of non-business bankruptcies in the United States reported being increasing [5]. Malaysian also shows a huge number on this matter. The number is increasing wherein 2018, a total of 303,415 bankruptcies was reported [6].
The phenomenon of over-spending needs to be addressed, as it can lead to social problems due to financial constraints which can result in theft, unauthorized money lending, and long-term personal loans. Limited financial resources, growing of needs and the rising cost of living are challenges that young people face in order to balance their current and future needs. With a variety of financial facilities, especially credit cards, banking, investment, loan and e-wallet, it requires consumers to have the financial knowledge to use the facilities. A variety of basic needs such as emergency care, child education, credit and risk management (insurance/takaful), retirement planning and estate planning with limited resources are a challenge for today's financial management which increases the cost of living as well.
A preliminary study on consumer financial 2016 survey data shows that 8.44% of Malaysians fall into the category of over-spending. However, studies on overspending issues is still limited in Malaysia. Several smart financial management have been taught and used but many of them were focusing on financial management and less focus on the spending style [7]. Various analysis has been conducted on the data, most of the researchers focus on the type of spending [7] and still limited study focus on overspending especially using an artificial intelligent data analytic method to overcome meaningful knowledge. Research on overspending lifestyle has been conducted and found out three main significant factors that influencing spending which was food, housing and 392 | P a g e www.ijacsa.thesai.org Thus, this study was conducted in order to develop a household over-spending model amongst Malaysians' B40, M40 and T20 using classification techniques. Thus, it will show the overspending pattern among B40, M40 and T20. The model was developed based on the demographic information only as given by the Department of Statistic Malaysia. The contribution of this work beneficial to see the serious factor of overspending issues in a family context. The finding also can be used in finding possible indicators for the multidimensional index in Malaysia.
The rest of this paper is organized as follows: Section 2 presents related work on the overspending data classification. Section 3 presents the material and method used for data classification. Section 4 reports the experimental results and Section 5, concludes the finding.

II. RELATED WORK
Household income management is needed for a wise spending regime. Due to budget constraints, households need to plan and prioritize basic necessities. Basic necessities are defined as daily needs which includes, food, housing, transportation, healthcare and clothing [8]. It was found that low-income households spend most of their income on basic necessities rather than on unnecessary expenses. However, they are still facing overspending.
According to Rashid et al. (2018), based on all income groups and strata, there are three types of household expenditure. First, food and non-alcoholic beverages. Second, housing, water, electricity, gas and other fuels. Third, transportation. These three groups can be classified as basic necessities. Analysis shows that group B40 spends almost twothirds of their total expenditure on non-alcoholic food and beverages, housing, water, electricity, gas, fuel, and transportation.
In 2019, [9] reported that Malaysian household spends 69.1% in four main groups which include, housing, water, electricity, gas, fuels, non-alcoholic, restaurant and hotel. On the statistical report, the highest contributors to overall consumption were for housing, water, electricity, gas and fuels (24.0%), followed by food and non-alcoholic beverages (18.0%), transportation (13.7%), restaurant and hotels (13.4%). Rashid et al. (2018) used three regression approach to analyse the relationship between total spending and basic needs among three income group households. Result shows that spending on basic needs has a significant relationship with the total expenditure between-group income. The basic needs of food, transportation and housing showed a significant relationship with total expenditure. In other words, by increasing spending on basic needs will increase household spending. However, the researches still use basic statistical analysis.
Artificial intelligent and data analytic has known as a popular approach where discovering accurate and meaningful knowledge in various domain utilising huge pass data made possible. Its offers various task such as classification, clustering, prediction, diagnostic, and deviation detection [10]. Where, the selection of the methods is depending what kind of business problem and type of data available. The aims are to identify the best model that gives the highest classification accuracy. Besides measure of accuracy, other measure such as mean absolute error, Root Mean Squared Error (RMSE), Fmeasure, Precision, Recall, the Kappa statistic, ROC and computation time are also considered in evaluating the performance of the model. Classification usually used for predicting or discovering new knowledge in a form of rules, tree or function [11] [12]. There are various algorithms that fall under classification technique such as J48, Naïve Bayes, Neural network, Support Vector Machines and Nearest Neighbour [13]. J48 is an enhancement of the C4.5 decision tree algorithm which functions by creating decision tree that based on data attributes. This algorithm identifies the attribute that discriminates instances most clearly which. The quality of rule, tree or function created from this algorithm can be determined by the accuracy of the model [14]. J48 had been proven to having highest accuracy compared to other algorithm. In analysing poverty level in Indonesia, [15] had done study using J48. Another study done using random forest algorithm in measuring poverty in urban area which provide more directional and timely decision-making assistance for the resource allocation and renewal planning of poor communities [16].
Neural Network is a mathematical model or computational model based on emulation of a biological neural system. There are several neural network algorithms such as ANN, CNN and kNN. The output value of the neuron is usually a non-linear transformation of the sum of stimuli. In more advanced models, the non-linear transformation is adapted by some continuous functions. NN is very popular for prediction with few attributes such as stock market prediction [17], weather forecasting [18] and customer churn [19]. NN was used by [20] in mapping out the poverty in Mexico. Where, CNN was used in predicting poverty mapping for urban areas using imagery for Digital Globe or Planet.
Another algorithm that usually used in study is Bayesian. This algorithm involves statistical methods that assign probabilities or distributions to events or parameters based on experience or best guesses before experimentation and data collection. A Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem (from Bayesian statistics) with strong (naive) independence assumptions. A more descriptive term for the underlying probability model would be the "independent feature model" [21]. Bayesian has shown as an accurate model for various problem [22] [23]. Naïve Bayes was used by [24] in mapping out the potentially poor family in Indonesia to planning the right method in preventing such occurrence towards the family.
Another classification technique that is popular is Support Vector Machine (SVM), which can be employed for both classification and regression purposes. SVM works and completes the analysis through a series of binary assessments on the data. SVM has shown as a good algorithm in the various domain is very popular particularly in image processing. 393 | P a g e www.ijacsa.thesai.org Thus, this study was conducted in order to develop a household over-spending model amongst Malaysians' B40, M40 and T20 using classification techniques that shows overspending pattern. The model was developed based on the demographic information as given by the Department of Statistic Malaysia (DOSM). The contribution of this work beneficial to see the serious factor of overspending issues in a family context. The finding also can be used in finding possible indicators for the multidimensional index in Malaysia.

III. MATERIAL AND METHODS
The study follows the standard data mining step of three phases which are (1) defining business goal, (2) data collection, (3) data preprocessing and preparation, and (4) development of model [25].

A. Defniing Business Goal
In this study, we define the business goals as to identify the patterns of overspending among various household income classes (B40, M40, and T20). Factors that contributes to the goal will be identified through features selection from source data set.

B. Data Collection
Survey was conducted in 2016 by the Department of Statistics Malaysia (DOSM), with a total of 4 million household expenditure records were obtained. However, only 20 percent of the data were used in this study due to constraints in obtaining all of the data from the DOSM. Data obtained consisted of 12 attributes on demographics data selected. They were number of households, area, state, race, highest certificate, marital status, gender, housing income, total expenditure and category which were described in detail in [26].

C. Preprocessing and Preparation
Phases involve in data preparation were namely attribute selection and class label determination. There were several phases involved in this study which discussed as follows.
The first phase was data preparation. This phase was done by identifying as much as attributes and records can be collected. Then the data cleaning process was done which include generating new attributes of poverty and the overspending category. In his phase, the cleaning process which include replacing incomplete and incorrect data with null was done.
The second phase was descriptive analysis using SQL language. The analysis was done towards the number the percentages and distribution in each state and the total spends for each category. The analysis was then used in overcoming basic knowledge on the overspending pattern amongst B40, M40 and T40.
The third phase was pre-processing by discretising data into the nominal form of attributes. Then, determining the best modelling followed by interpreting the knowledge was conducted.

1) Income class level a) Generate income class:
The income class was generated by referring to the amount of income and state set by the Malaysian government [26]. The algorithm was translated into the rules in Table I. The overspending category was divided into two parts which were 0 and 1. 0 implied that the group fall under not overspending category while 1 was for the group whose total expenditure did exceed the total income. The preparation of overspending class was done by using excel software using the following formula: IF (Revenue -Spend) <0 THEN category overspending = 0 ELSE category overspending = 1 12 attributes as in Table II were ranked using classifier method to obtain influencing factor. As a result, 8 attribute which produced meaningful reading and ranked higher among the rest was obtained with category (0.19), sex (0.05), education (0.05), ethnic (0.03), marriage (0.02), number of a family (0.02), province (0.018) and state (0.018).
2) Discretization: Discretization is a method which converting continuous data into categorical data [27]. For this method, data form three attributes were processed using depth equal frequency method. Table III shows the description of discretized attribute.

D. Model Development
The classification model was developed using the 10 foldcross validation using application WEKA (Waikato Environment for Knowledge Analysis). 10 experiments were conducted using each five classification models which were J48, Naïve Bayes, Artificial Neural Network, Swarm Vector Machine and k-Nearest Neighbour. Table IV  The experiment also conducted using other four classification model where the nearest accuracy was form J48 with accuracy of 70%.

1) Distribution of income category based on state:
Income data analysis distribution is performed by making SQL directories of household, state, and class ID variables and importing them to Microsoft Excel. Fig. 1 shows the distribution of the income category by state.
The data shows that the population of Sabah, Sarawak and Selangor are relatively high compared to other states. The Federal Territories of Putrajaya, Labuan and Perlis show the least amount of data. Table V shows the results of statistical analysis of income, expenditure and overspending among the B40, M40 and T20 classes. The table also shows the percentage distribution of the number of households analysed in this study. 8.5% of the population fall into overspending, where 42% of the households belonged to the B40, 39% to the M40 while the T20 to only 10%. Out of a total of RM91 million monthly income, RM 54 million was spent each month. However, the study found 1027 households suffering from overspending in B40, while 166 in B40 and 35 in T20, which is about 17,2,1 per cent, respectively. The analysis results also show some of the B40 group is spending more than the M40 and T20. The minimum spends per month for the overpaid is 70 cents especially for students. It can be concluded that 83% B40 lifestyle was able to manage their money very well and very little per cent of M40 and T20 fall into overspending. It is undeniable that the B40 group in Malaysia is very wise to save the money they have as only 17% of the B40 group belongs to those who are overspending. Whereas in the M40, only 2.9% overspending and T20 1.3% overspending. However, these percentage overspending population can cause social problems such as theft, bribe or bankruptcy.

2) Statistical analysis of income, spending and overspending:
The table shows the maximum overspending amount for B40 is RM 5868. It can be concluded that the amount of RM6000 per household is become significant number avoiding household fall into overspending.

3) Income range and overspending by income class:
Further analysis on overspending shows in Table VI. The table  shows the B40, M40 and T20 income class populations that fall in the overspending category by income range and spending range.  The table shows the most populous population of the overspending is income below 2,768. 75% of B40 fall this category, while 22% B40 and 40% M40 fall overspending for income more than RM 2,768.1 and less RM 5,197. The range shows that 97% B40 fall into overspending with income less RM 5197.70.
Data also show 39% of M40 fall into overspending for income less RM 5197.70. Its indicators that basic need of living in Malaysia per household should be less than RM 5500 but nice if RM 6,000. However, these findings are not conclusive because there is a need for more in-depth study of the aspects of spending patterns among Malaysians that need to be studied. The analysis results do not describe the type of expenditure allocated that they belong to the overspending group, which fall into the leisure lifestyle.
The data also shows only 3% of B40 fall overspending more than RM 5,5197.7 as we defined as improper financial management. Seven 7 households fall into this category, most of the area family with 2-5 number of members per household, income between RM3K to RM7K, education is SPM and diploma, 2 females and 5 males, and from Johor, Kedah and Selangor. Similarly, data also shows about 60% of M40 fall into overspending in which income more than RM 5,519.7. Most M40 overspending is from Kedah, Sarawak and Perak, their education level mostly SPM/STPM, they are Bumiputera and Chinese. While for T20, most of them area Bumiputera and India, with education level either SPM, certificate or diploma and they are scattered in Malaysia.

B. Classification Experiment Result
As stated earlier, the aims for mining the data is to discover knowledge on overspending pattern using a classification approach. Table VII shows the summary of classification accuracy result for the five classification models. Where, the bold value showed the best result for each classification model. The accuracy shown here is represented the best model obtained at which fold of training vs. testing data. SVM has proven to be the best model for household overspending model amongst B40, M40 and T20 with accuracy of 89.17% followed by J48, ANN, kNN and Bayes with accuracy of 88.84%, 86.97%, 84.77% respectively. Even though SVM shows the best accuracy, J48 classification model was selected for rule generation. This is due to the ability of J48 to presenting the model in the form of rule which make it easier for knowledge discovery.
Tables VIII to X show the overspending rules generated from six types of attributes for B40, M40 and T20 group respectively. The rules were extracted from decision tree developed by J48 algorithm which represent the important in sequence.
From the rules generated, it can be seen that for B40 group that live in Melaka, Perlis, Perak and Sarawak, over spending happened. For B40 that live in Perlis, Federal Territories, Labuan, Penang, Johor and Selangor, overspending happened if expenditure below RM 3952.62. Overspending also happen among B40 if total expenditure is between RM 3952.62 and RM 6557.03 for resident of Johor having fewer or 4 children. In Terengganu, the B40 group said to be overspend with total expenditure of less than RM4800.59. In Malacca, B40 group considered as overspent when their spending reached RM5367.57. Lastly, overspending would happen if B40 lives in rural area of Kedah.
In Perak, Sarawak, Perlis, Labuan and Selangor, M40 who spend more than RM 6557.03 considered as overspent. In Melaka, the M40 overspend when the total expenditure reached RM 5367.67. In Labuan, the M40 overspend when the total expenditure reached RM 10350.4. However, for M40 who lives in Kedah with total expenditure between RM7297 and RM10350.4 they considered to be in overspend category. In Kuala Lumpur and Terengganu, the M40 group overspend when their total expenditure exceeds RM4800, especially for those living in the city. These rues can be seen in Table IX. Table X shows the overspending rules for T20 group. T20 group considered to be overspent when their total spend exceeds RM 6557.03. The overspending T20 is among those who lives in Negeri Sembilan and Pahang. In Malacca, T20 considered to be overspent when total expenditure exceeded RM9875.87. Lastly, in Labuan, the T20 group is overspent when the total expenditure exceeds RM10350.4.  Studies have shown descriptive analysis results and analytical data analysis using the J48 classification method to produce demographic-based and overspending rules based on expenditure type. Descriptive analysis showed the distribution statistics of the B40, M40 and T20 groups by state. Comparative analysis of the two variables can be performed individually to compare or produce specific patterns. Similarly, a descriptive analysis of overspending on expenditure type shows the average distribution of expenditure types for B40, M40 and T20.
There are six attributes that influence most to overspend which are state, race, income, strata, number of households and categories. The rules can determine the exact rule for overspending for each state. The rules show the attractive features of the overspending when the total expenditure is less than or above RM 6557 per state followed by other features such as race, strata and household numbers. For example, in Kuala Lumpur, the M40 group with more than four children is facing overspending with a cost of over RM 6557. With this model, it can help the B40 category not only depend on the amount of income but also need to look at the number of household aspects. The model also proved that the number of household member can be one of the variable in identifying poverty category as B40, M40 or T20.it can be seen that the higher the number of household member, the higher the total expenses. Moreover, ethnic also become one of the overspending factor which only can be applied in Sabah. This paper has shown how data analytic help in identifying knowledge of attributes that influence the category of poverty based on demography. However, the rules produced were based on the B40, M40 and T20 data which only covering 20 fractions of the actual number of questionnaires conducted by DOSM. The findings may be inaccurate due to the limited amount of data that reflects the actual occupation and population of Malaysians. This study able to show how data science or analytical data in the domain of data mining able to find more detailed knowledge than existing data sources. The results of the descriptive data section and the analysis of the overspending models in B40, M40 and T20 provided clear picture of analytical data capabilities in the pursuit of more detailed knowledge to assist authorities in making decisions or planning strategic plans for income and expenditure management in Malaysia. At the end of this study, factors of income, state, and number of children/members per house hold were found to be among most influence factors in determining Malaysian Family wellbeing index. However, this paper which focusing on demographic data could not assist in identifying lifestyle. Further research focusing on development overspending model based on type of expenses such as total food and transportation are more beneficial to identify the lifestyle. Therefore, study on type of expenses that have highest amount of expenses can be done in the future. This thus can urge these group to minding their expenses in that base so that they would not fall into overspending category.