An Efficient Audio Classification Approach Based on Support Vector Machines

In order to achieve an audio classification aimed to identify the composer, the use of adequate and relevant features is important to improve performance especially when the classification algorithm is based on support vector machines. As opposed to conventional approaches that often use timbral features based on a time-frequency representation of the musical signal using constant window, this paper deals with a new audio classification method which improves the features extraction according the Constant Q Transform (CQT) approach and includes original audio features related to the musical context in which the notes appear. The enhancement done by this work is also lay on the proposal of an optimal features selection procedure which combines filter and wrapper strategies. Experimental results show the accuracy and efficiency of the adopted approach in the binary classification as well as in the multi-class classification. Keywords—Classification; features; selection; timbre; SVM; IRMFSP; RFE-SVM; CQT


INTRODUCTION
Music Information Retrieval (MIR) is a growing field that benefits signal processing development and communication media tools.It uses the pattern recognition techniques to solve problems of music digital transcription including classification process.This classification is aimed to identify the artist in a given musical track.
In order to improve the performance, many researches are focused on the choice of an efficient classification algorithm, and on the use of relevant features which are able to answer questions in query.To do so, the conventional solution is to choose a set of elements that reflect the musical timbre and reduce their dimensionality by using filters method, wrappers method or embedded strategy [1][2] [3].The purpose of this work is firstly to introduce and operate a new family of socalled transition features which characterize the musical context in which the notes appear and optimize the classification algorithm.Secondly, we improve the performances by combining features selections approaches.
The rest of this paper is organized as follows: Section 2 reviews the previous works as state of the art on the automatic audio classification algorithms and the features selection.The used classifier is presented in Section 3. Section 4 is devoted to the proposed approach which consists of extracting features and describes into details the reduction dimensionality technique.The parameters classifier and approach practical implementation results are presented in Section 5. Finally, we conclude the paper and suggest future related work in Section 6.

II. STATE OF THE ART
To achieve an audio recognition task, different classification algorithms have been designed and tested [4] [5] [6] [7] [8].Among several approaches, it seems that a consensus has been developed around the use of Support Vector Machines (SVM) [9][10] [11][7] [12][13] because of their flexibility, computational efficiency, capacity to handle high dimensional data and their profits of the feature selection [11].
The features selection method has been a central topic in a variety of fields such as pattern recognition and machine learning .The objective is to find an optimal subset of relevant and not redundant features in order to guarantee classification accuracy, computational efficiency and learning convergence [13][14] [15].In filter strategy, the features selection is performed independently of the learning classifier and the Inertia Ratio Maximization using Feature Space Projection (IRMSFP) algorithm [6] [13] is the most popular due to its simplicity, rapidity and efficiency.For the wrappers approach, the well-known feature selection method combined with Support Vector Machines is the Recursive Feature Elimination (RFE-SVM) algorithm [14].It generates the ranking of features by using backward data elimination until the highest classification accuracy is obtained.However, the RFE-SVM is a greedy method that only hopes to find the best possible combination for classification [16] and may be biased when the features are highly correlated [17].www.ijacsa.thesai.org

III. SVM CLASSIFIER
In this section, we first introduce the basic theory of the SVM for two-class classification problem, second the used strategy to solve multiclass problem and third the features selection method which is combined with Support Vector Machines.

A. Basic Theory of SVMs
Given a training set of instance-label pairs ( ) where and * + .The support vector machines (SVMs) [18] are used to find a hyperplane to separate the data with the maximum margin.This requires the solution of the following optimization problem: Subject to Using a soft-margin instead of a hard-margin, the primal problem for SVMs is obtained: -* + are slack variables which allow for penalized constraint violation through the penalty function ( ) which defined by (4): -C is the parameter controlling the trade-off between a large margin and less constrained violation -( ) represents the mapping from the input space to the features space.However researchers prefer to use a kernel function K(.,.) given by the following expression: ( ) ( ) ( ).Practically, the most commonly used kernel functions are: -Linear: ( ) ( ) -Radial basis function (RBF): Here, and d are kernel parameters, furthermore a practical use and implementation of the SVM classifier is presented in [19].

B. Multiclass Classification
As investigated in [20] the multiclass classification based on SVMs is commonly performed by one of the two methods "one-vs-one" or "one-vs-all".Both consider the multi-class problem as a collection of "two-class classification" problems.For k-class classification, "one-vs-all" method constructs k classifiers where each classifier constructs a hyperplane between one class and the rest ( )classes.A majority vote or some other measures are applied over the all possible pairs for decision.For the "one-vs-one" approach, ( ) classifications are realized between each possible class pairs and similarly a voting scheme is applied for decision.
For more speed and reliability, Direct Acyclic Graph SVM (DAGSVM) [20][21] is often adopted.In this approach, the testing phase uses a rooted binary directed acyclic graph which has ( ) internal nodes and k leaves.Each node is a binary SVM of i th and j th classes.
Due to that hierarchical classification based on an acyclic tree structure, the DAG paradigm allows both of: -Bringing a multiclass classification in a set of two-class classifications.
-Computing the classification process of k classes with ) comparisons instead of ( ) ones required for the basic "one-vs-one" approach.

C. Recursive Feature Elimination-SVM
The well-studied RFE-SVM is a feature selection algorithm for supervised classification which forms part of the wrapper method.It integrates filtering in the SVM learning process not only to evaluate each subset using SVM classifier but also to have information on each feature contribution in the separating hyperplane construction.The RFE-SVM is based on ranking all the features according to some score function and eliminating recursively one or more features with the lowest score.
According to [14], the RFE-SVM algorithm can be decomposed into four steps:

1) Train an SVM on the training set; 2) Order features using the weights of the resulting classifier;
3) Eliminate features with the smallest weight; 4) Repeat the process with the training set restricted to the remaining features.

IV. PROPOSED APPROACH
As already mentioned, the classification approach is based on three steps: using new features and timbral ones, features selection by combining filter and wrapper methods and optimizing the SVM classifier.The related block diagram is illustrated in Figure 1.

A. Features extraction
The basis of our approach focuses on the features of musical signal according to the block diagram scheme as in Figure 1.The obtained features are divided into the following two families.
-An additional and interesting family of features that reflects the transition segments between successive notes in the musical signal.

1) Features representing the musical signal timbre
The features part of this family are designed to represent the most important perceptual properties of a sound.They constitute a set of scalar parameters related to the spectral description of the musical signal.They were the subject of an extensive literature; further studies of their extraction are presented in [23].The choice of these features, said low level, are generally depending on the desired application and on extraction duration.According to their modality of calculation as detailed in [24], they are organized as follow: -Temporal features: convey information about the signal time evolution.
-Energetic features: features referring to various energy of the signal.
-Spectral features: those features are computed from signal time frequency representation without prior waveform model.
-Cepstral features: represent the shape of the spectrum with few coefficients using Mel-bands instead of the Fourier spectrum.
-Harmonic features: those features are computed from the detected pitch events associated with a fundamental frequency (F 0 ).
-Perceptual features: are computed from auditory filtered bandwidth versions of signals which aim at approximating the human perception of sounds.
The presentation of these features is illustrated by the following table No. 1 In this work, we will extract the parameters of a signal related to the Oriental music known by its richness in melody [25] and generated by a lute.That signal is therefore relatively short, non-stationary and assumed to contain an almost percussive sound.Spectral features, Cepstral features, Harmonic features, and Perceptual features are computed based on a Short Time Fourier Transform which is expressed according to (9): For best time-frequency localization, using a window w(t) with a variable length is more efficient [26].Features extraction is then based on splitting the audio signal s(t) into successive frames where each frame s k (t) of index k and duration T k is calculated by (10): ) is the Hann window expressed by (11) : According to the Constant Q Transform (CQT) approach applied in the oriental music context [26], the coefficient Q=37 and the duration T K is given by ( 12)

2) Transition features between notes
The majority of classical timbre features is restricted only to the spectral characteristics of short duration without modeling a music temporal aspect.Thus, in order to describe better a melody that is made for monophonic recordings, we propose, in this section, to use additional most representative features of musical timbre, taking into account the pattern musical note and the musical context in which the note appears.
So, when examining the envelopes of lute"s notes, we see that they are closest to a well-known pattern modeled by the classical "ADSR" model [25] (Figure 2).Certainly, when two consecutive notes are separated by a "silence", the problem of intra-note correlation hardly arises, unlike the majority of cases where they have a temporal entanglement.
To characterize this phenomenon, we introduce a segment called "Transition" which starts at the beginning of the first note"s Release phase and finishes at the end of Attack of the following one.We represent this articulation in Figure 4 while showing the evolution of the energy and fundamental frequency (F 0 ) during this new segment.We denote: -T c : Instant related to the energy envelope minimum.
-T init : Instant of first note"s Release phase beginning.
-E LT : Energy of the first note at the instant T init -T end : Instant of the following note Attack end -E RT : Energy of the following note at the instant T end Features associated with the transition between these types of notes are:

-
The energy change during the interval T D (14) ( ) We also use two features introduced and investigated in [27] namely: -The ratio between the instant T c and the transition duration T D (15).This parameter has proven useful when reconstructing the amplitude envelope"s during transitions ( ) -A feature representing the legato which indicates how to link music notes together.This descriptor, whose relevance was assessed in [28], is described in (16).

∫ ( ( ) ( )) ∫ ( ( )) ( )
To compute LEG, we use the schematic view of figure 4. First, we join start and end points on the energy envelope contour using a line L t which represents smoothest case of detachment.Then we calculate both the area A 2 below envelope energy and the area A 1 between the energy envelope and the joining line .Our legato feature is finally defined as determined by (16).

B. Features selection
Since all calculated features are not relevant for the classification task, they must be previously processed by reducing dimensionality in order to keep only relevant candidates and therefore to facilitate the classification task.When the filter is adopted, the solution is very simple and fast, but features are selected based only on their intrinsic characteristics and regardless of the used classifier.In the wrapper method, the system is accurate but without any guarantee of rapid learning and features redundancy.Taking into account this fact, the adopted approach consists of combining the simplicity and the accuracy by associating both filter and wrapper methods as mentioned in the Figure 5.
In this topology, original features are firstly filtered in order to eliminate the redundant ones and the outliers.Due to this operation which selects variables as a pre-processing step, the obtained features are less correlated and a significant reduction is already performed.This will allow a better understanding of the contained information in this subset of selected features.The computing duration will be therefore short for the wrapper which selects only features that improve the prediction accuracy and optimize the classification performance.

Fig. 5. Schematic view of features selection approach
To perform the filter, we choose a structure that simultaneously meets two fundamental criteria: Criterion A: Choosing a subset of informative features within each class.Criterion B: Selecting non-redundant and uncorrelated features.
For this, we adopt an algorithm based on linear discrimination analysis (LDA) strategy, called Inertia Ratio Maximization with Feature Space Projection (IRMFSP) [6][13]].This simple and efficient filter whose relevance was assessed in [13], selects features to satisfy iteratively criterion A (Inertia Ratio Maximization) and criterion B (Feature Space Projection).The implementation of IRMFSP is composed of two steps.The first one selects at iteration l, the non-previously selected feature which maximizes the ratio between inter-class inertia and the total inertia expressed as follow (17): Where n is the total number of features, n k is the number of features belonging to class k, ( ) denotes the value of feature of index d affected to the vector i. µ d,k and µ respectively denote the average value of feature d into the class k and for the total dataset.The second step of this algorithm aims at orthogonalizing the remaining feature for the next iteration as follows ( 18): Where ( ) is the vector of the previously selected feature is its normalized form.
Due to that filtering operation, the RFE-SVM wrapper will be able to measure rapidly and accurately the impact of each selected feature on the classification procedure.

A. Materials
In this section, we are interested in the real tracks resulting from oriental music.Various artists and associated musical signals are shown in the following Table 2.As justified in [29], the most appropriate criteria function to evaluate the efficiency and accuracy of such classification process is to use the F-Measure indicator (FMSR), which is the harmonic mean of the recall (RCL) and precision (PRC).FMSR is given by ( 19):

B. Classifier Optimisation:
This section aims to set the classifier parameters based on binary classification.Beyond the fundamental principle of parsimony research, the SVM approach leaves, in practice, a number of options and settings to the user such as: the choice of the regularization parameter, and the choice of kernel type.

1) The kernel function:
The adopted kernel is the Exponential Radial Basis Function (ERBF) as it represents the best way to follow the non-linear decision surfaces.

Extracted features {EF}
Selected features {SF} Filter Wrapper www.ijacsa.thesai.orgFor greatest robustness, instead of ( 7), we use a structure which takes into account the number of learning elements [18].The kernel function expression is given by (20).
-m is the dimension of the observation vectors: σ represents the width of the Gaussian function.It is a main parameter that affects the complexity of decision surfaces.Interesting choices are situated in the interval [0, 1].The default value (before refining parameter) is σ = 1.Optimize the classifier involves determination of that parameter in order to maximize the FMSR performance on the training dataset.The obtained result is shown in Figure 6.It can be seen that the overall performance shows a peak for σ=0.5 (F=91.94%).Nevertheless, the performance variations are very weak (about 0.24%) this justifies the robustness of the SVM classifier.

2) Controlling parameter
As mentioned in section III, this is a factor that controls the tradeoff between maximizing the margin of class"s separation and minimizing of classification errors on the training set.It is a balancing parameter to set a priori, in order to make floppy the margin"s SVM.Certainly, there is no strict relation to the exact calculation of C value.However the best practical results are usually obtained using an adaptive value ""C dat ""of that penalty parameter based on the number of « m » learning elements [18].Thus, C dat is obtained according to (21) wherein the kernel function ( ) is defined by (20).

3) Performance of features selection approachs:
After the SVM optimization, we evaluate the efficiency of used features selection algorithms by comparing them according to the desired learning elements size.
The first comparison is about the complexity criterion.In this context, we set 50 optimal features and measure the processing duration for each features selection algorithm.The obtained result is presented in Table 3 below and shown clearly that the IRMFSP filter is faster than RFE-SVM wrapper.When two algorithms are used, the IRMFSP ensures well its role of preprocessing step which reduces the learning duration.The second comparison is made by F-measure criterion.Figure 7 compares that classification performance obtained as a function of the used features number.The performance of the IRMFSP filter is almost constant.The efficiency and robustness of the RFE-SVM wrapper alone or combination (RFE-SVM+ IRMFSP) are noticeable especially when the number of features is relatively high (>50).
Overall, both algorithms are less sensitive to the selected features number reduction and they perform a more reliable ranking of the most useful features by positioning those most effective at the forefront.
According to the results from Tables 6 and 7, a direct comparison between those two selection approaches (RFE-SVM vs IRMFSP) proves that obtained results are better when the two algorithms are combined regardless of filter simplicity and its efficiency.

4) Multiclass classification:
To evaluate this classification process in the multiclass context, we aim to highlight the effectiveness of the used features and their handling.The obtained results are presented as confusion matrix.(Table 4) The resulting confusion matrix of dataset using 50 audio descriptors is presented in Table 4 and shows an average classification accuracy of 76,88 % where each artist is well classified with a minimal accuracy of 68,7% for the Artist N°2.These results are good and somewhat better than those described in literature [6][13] which uses only the timbral features without optimizing the SVM classifier.

VI. CONCLUSION AND FUTURE WORKS
In this paper, we have proposed a reinforcing audio recognition method by improving the extraction of well-known timbre features and taking into account features that reflect the transition between musical notes.We have also developed a practical method of the SVM classifier optimization as well as a features selection method that benefits from both of filters and wrappers advantages.In this approach, the filter has achieved a simple and straightforward features selection in order to get a subset of relevant factors most suitable to interpret and to deal with the Wrapper.Although the obtained results are interesting and encouraging, some aspects may be developed in future works.So, as perspective, we intend to investigate the use of other features kind and explore other features selection algorithms in such classification process.

Fig. 2 .
Fig. 2. Model ADSR of the note"s envelopeThe melody of a musical signal can only be achieved if the notes follow each other according to the sequence required by the musical score.In such a succession, energy E XX of the resulting signal can be represented according to the ADSR model of Figure3as follows.

Fig. 3 .
Fig. 3. Schematic view of the intra-note segmentation energy

Fig. 4 .
Fig. 4. Schematic view of the characterization of the "Transition" segment

Fig. 7 .
Fig. 7. Performance of the classification based on the features number

TABLE I .
LIST OF FEATURES RELATED TO THE MUSICAL TIMBRE

TABLE II .
DATASET OF ARTISTS AND ASSOCIATED MUSICAL SIGNALS

TABLE III .
COMPLEXITY OF FEATURES SELECTION ALGORITHMS