Modeling for Car Quality Complaint Classification based on Machine Learning

—Cars play an important role in many aspects of people's social life, and the effective handling of car quality complaints is of great significance to the proper running of cars and the reputation maintenance of car brands; effective classification of car quality complaint texts is the basis of the efficient handling of corresponding quality complaints, while relying on manual classification has disadvantages such as heavy workload, experience dependence, and error proneness; machine learning methods have been quite widely used in the automatic classification modeling for different types of natural language texts. It is of great practical significance to construct the automatic classification model of car quality complaints based on machine learning. Based on the characteristics of car quality complaint texts, this study vectorized the texts after word segmentation, performed feature selection and dimension reduction based on correlation analysis, and combined the progressive model training method and support vector machine to construct the classification model; in model reliability analysis, it was evaluated based on the effect of data amount on the modeling and the effect of text length on the prediction probability distribution. The results show that based on the method in this study, effective automatic classification model of car quality complaint texts could be constructed.


I. INTRODUCTION
The studies on text classification are quite extensive, but there are few related studies on complaint text, and the applicability of classification methods is closely related to text characteristics. The composition and structure of cars are relatively complex; during the long-term use of cars, quality problems might gradually appear, reasonable handling for the quality problems has important effect on the normal operation of cars and the maintenance of user experience, which is also an important decision-making influence factor for people choosing car brand and car product.
Machine learning has been quite extensively applied to natural language text classification in recent years [1][2][3][4]. Text classification based on machine learning mainly involves two core links: text vectorization and classification modeling. The methods used in text vectorization mainly include the methods based on word frequency [5][6][7], the methods based on distributed static word vectors [8][9][10][11], and the methods based on distributed dynamic word vectors [12][13]. The methods used in the classification modeling mainly include classical machine learning methods [14], various neural networks [15][16][17][18][19], ensemble learning [20] and so on.

II. TECHNICAL ROUTE
The technical route of this study includes seven parts, including data sorting, data characteristic analysis, word segmentation, feature extraction, classification modeling, model reliability analysis, summary and prospect.
The part of data sorting includes the acquisition of basic data, and the construction of research dataset based on the text characteristics and research purposes. The part of data characteristic analysis conducts a comprehensive overview of the dataset mainly from the aspects of data distribution, text length characteristics, the distribution of car type, the distribution of purchase time, and the distribution of car brand.
The research object of this study is Chinese text and the study involves word segmentation. The word segmentation part in the technical routes include using Jieba for word segmentation, removal of stop words, word frequency distribution analysis, classification feature word analysis, etc. The removal of stop words aims mainly at removing function words which have little significance for classification, such as the connectives in complaint texts. Word frequency distribution analysis mainly analyze the discrimination and contribution potential of high-frequency words in the classification of car quality complaints, from the perspective of the word frequency distribution of global high-frequency words in different categories. Classification feature word analysis mainly analyzes the characteristics of high-frequency words in each category after removing stop words, and conducts preliminary data status analysis.
The feature extraction part mainly involves three links: text vectorization, feature correlation analysis, and feature selection.
In the text vectorization process, the text data is converted to vector form based on bag-of-word method, which doesn't include stop words. In the feature correlation analysis link, the frequency correlation of word features is analyzed through correlation matrix constructing, and the feature selection is performed by removing highly correlated word features to reduce vector dimension so as to improve the efficiency of modeling and classification. www.ijacsa.thesai.org In the classification modeling part, the progressive strategy is used, the proportion of the features used in modeling is gradually increased in multiple stages; the modeling effects under different proportions of features are compared to obtain the optimal modeling feature quantity. The meaning of the progressive strategy is that too few features might don't contain enough necessary information for building an effective classification model, at the same time, too many features might confuse the core information and reduce the classification ability of the model, furthermore, too many modeling features would also result in negative effects on the efficiency of modeling and classification. The classification modeling uses the method of support vector machine, and the evaluation of modeling effect is analyzed from two aspects: the classification quality on the whole dataset and the quality on different categories of complaint texts.
The model reliability analysis part includes three aspects: the effects of data amount on the overall modeling indexes, the effects of data amount on the classification effect in each category, and the effects of text length on the probability distribution of the classification prediction. The amount of training data commonly has an important effect on the reliability of the model, too little data might don't be enough to train a reliable and stable model, and the model's predict ability to new data outside the research dataset might be insufficient or unstable, generally, based on more data, more stable model could be obtained; at the same time, after the amount of data reaches a certain threshold, the continued increase of the data amount commonly no longer has significant effect on the stability of the model. For different types of texts, the amount of data required for classification model training commonly varies. This study analyzes the effect of data amount on the classification modeling of car quality complaint texts by incrementally adding of data and comparing multiple rounds of model training; the evaluation and analysis are carried out from two aspects: the effect of data amount on the overall indexes of modeling, and the effect of data amount on the classification effect in each category. In addition, the text length might have effect on the text classification prediction effect, and the discrimination of the classification prediction probability distribution could reflect to some extent the reliability of the classification prediction, therefore, in this study, the effect of text length on the probability distribution of classification prediction is regarded as another aspect of the reliability evaluation.
At the end of the study, the results are summarized to obtain effective conclusions, and the deficiencies of the study are analyzed, so as to provide reference for subsequent related research and application.
The technical route of this study is shown in Fig. 1. The research data of this study comes from the Beijing Car Quality Net Information Technology Limited Company. The dataset of this paper includes 8 categories of car quality complaint text data, including engine/electric motor, transmission, clutch, steering system, braking system, tires, front and rear axles and suspension system, car body accessories and electrical appliances. The data amount is 2400, and for every category, the data amount is 300. The data amount and text length characteristics are shown in Table Ⅰ.
The car quality complaint texts involve attribute labels such as car type, purchase time, car brand, etc., and the attribute differences might influence the classification model training and the texts classification prediction. The data distribution of the dataset used in this paper in terms of car type, purchase time, and car brand is shown in Fig. 2. The research data of this study is Chinese text, and it is necessary to divide texts into words. This study uses the Jieba word segmentation tool which is widely used in the field of Chinese word segmentation to separate the words; the statistics and analysis for word segmentation results are carried out from the aspects of category, number of characters, number of separated words, number of unique words, repetition rate, etc. The word segmentation results are shown in Table Ⅱ. Fig. 3 depicts the word frequency distribution of the global high frequency words in different categories. The difference of the frequency distribution in different categories of the global high frequency words is an important reference factor for the evaluation of the potential classification discrimination contribution ability of these words. If there are widely significant differences in the word frequency distribution of global high-frequency words in different categories, the method based on word frequency might have well applicability for corresponding text classification modeling scene.
After the word segmentation and the removal of stop words, the top 20 high-frequency feature words of each category are shown in Table Ⅲ.

V. FEATURE EXTRACTION AND CLASSIFICATION MODELING
This study uses bag-of-word method which is based on word frequency for text vectorization; the correlation of features is analyzed based on correlation matrix; and feature selection is performed based on feature correlation to reduce the dimension of text vectors and improve the efficiency of classification model training and text classification. The correlation heatmap of the global high-frequency words after removing stop words is shown in Fig. 4. Due to space limitations, Fig. 4 only shows the relevance of the top 15 high-frequency words in the global word frequency.
This study uses a progressive feature selection strategy to incrementally set the feature usage ratio; the classification model is trained based on the SVM method, and the modeling results are evaluated and analyzed from the perspectives of overall accuracy, overall recall, overall F value, and F value of each category. The progressive feature selection strategy is beneficial to obtain a reasonable threshold of the model feature quantity, if too few features are used, the modeling effect might be adversely affected due to insufficient information, while if too many features are used, the model quality, model training efficiency, and classification prediction efficiency might be adversely affected due to the introduction of non-core information confusion. The training results of the classification model are shown in Table Ⅳ. www.ijacsa.thesai.org  Model reliability analysis is of great significance to the evaluation of model quality. This study analyzes the reliability of the model from two aspects: the effect of data amount on the classification modeling and the effect of text length on the classification prediction. The training data amount commonly has direct effect on the reliability and stability of text classification model, too little data might lead to limited applicability of the trained model and unstable prediction ability for new data, after the amount of model training data reaches a certain value, the effect of incremental data on the model training effect is commonly no longer significant.
Based on incremental data setting, this study compared multiple rounds of text classification model training, and the result parameters are shown in Table Ⅴ. Text length is an important factor in text classification model training and classification prediction, the difference of the probability distribution in the classification prediction for different lengths of texts is another effective measure of the reliability of the classification model. In this study, the first 8 texts and the last 8 texts in the global ranking of text length are selected to analyze the probability distribution in the classification prediction; the results are shown in Table Ⅵ. This study focuses on the automatic classification of car quality complaint, the research content mainly includes data characteristic analysis, word segmentation, feature extraction, classification modeling, and model reliability analysis. The research results show that based on the method combining the text vectorization based on word frequency, the feature selection and dimensionality reduction based on correlation analysis, and the feature increase SVM model training, the effective classification model for car quality complaints texts could be obtained; in this study, the best modeling effect is obtained when using 938 features for model training, among the global indexes, the accuracy, recall, and f1-score reach 0.9375, 0.9375, and 0.9384 respectively, the highest f1-score of each category is 1.0000, and the lowest is 0.8657; in the model reliability evaluating based on incremental data amount, after the data using ratio reaches 75%, the training effect is almost stable, the classification prediction probability distribution analysis based on the global long texts and global short texts shows that the classification probability values obtained from the model shows a high degree of discrimination overall.
In general, based on the method of this study, effective modeling for the automatic classification of car quality complaint texts could be realized; at the same time, the research content of this study belongs to theoretical research which has not been applied to practice, it is expected that this study could provide effective reference for subsequent research and practical application.