Speaker Identification using Frequency Dsitribution in the Transform Domain

In this paper, we propose Speaker Identification using the frequency distribution of various transforms like DFT (Discrete Fourier Transform), DCT (Discrete Cosine Transform), DST (Discrete Sine Transform), Hartley, Walsh, Haar and Kekre transforms. The speech signal spoken by a particular speaker is converted into frequency domain by applying the different transform techniques. The distribution in the transform domain is utilized to extract the feature vectors in the training and the matching phases. The results obtained by using all the seven transform techniques have been analyzed and compared. It can be seen that DFT, DCT, DST and Hartley transform give comparatively similar results (Above 96%). The results obtained by using Haar and Kekre transform are very poor. The best results are obtained by using DFT (97.19% for a feature vector of size 40).


INTRODUCTION
Recently a lot of work is being carried out in the field of biometrics.There are several categories of biometrics like fingerprint, iris, face, palm, signature voice etc. Voice as a biometric has certain advantages over other biometrics like: it is easy to implement, no special hardware is required, user acceptability is more, and remote login is possible [1].In spite of these advantages it has not been implemented to a very large extent because of the problems like security, changes in human voice etc. Human beings are able to recognize a person by hearing his voice.This process is called Speaker Identification.Speaker Identification falls under the broad category of Speaker Recognition [2 -4], which covers Identification as well as Verification.
Speaker Identification (also known as closed set identification) is a 1: N matching process where the identity of a person must be determined from a set of known speakers [4 -6].Speaker Verification (also known as open set identification) serves to establish whether the speaker is who he claims to be [7].Speaker Identification can be further classified into textdependent and text-independent systems.In a text dependent system, the system knows what utterances to expect from the speaker.However, in a text-independent system, no assumptions about the text can be made, and the system must be more flexible than a text dependent system.Speaker Recognition systems have been developed for a wide range of applications like control access to restricted services, for example, for giving commands to computer, phone access to banking, database services, shopping or voice mail, and access to secure equipment [8 -11].
We have proposed Speaker Identification using row mean of DFT, DCT, DST and Walsh Transforms on the speech signal [24 -25].We have proposed speaker recognition using the concept of row mean of the transform techniques on the spectrogram of the speech signal [26].We have also proposed speaker identification using power distribution in the frequency domain [27 -28].
In this paper we have extended the technique of power distribution of the frequency domain to four more transforms i.e.Hartley, Walsh, Haar and Kekre Transform.Here we have used the power distribution in the frequency domain to extract the features for the reference as well as test speech samples.The feature matching has been done using Euclidean distance.The various transform techniques have been explained in section II.In Section III, the feature vector extraction is explained.Results are discussed in section IV and conclusion ion section V.within parentheses, following the example.

II. TRANSFORM TECHNIQUES
The Transform when applied on a speech signal converts the converts it from time domain to frequency domain.In this paper seven different Transform techniques have been used.Let y(t) be the speech signal in the time domain and y0, y1, y2, yN-1 be the samples of y(t) in the time domain.The Discrete Fourier Transform of this signal is given by (1).The DFT is implemented using Fast Fourier Transform (FFT). (1) Where y n =y(nΔt) is the sampled value of continuous signal y(t); k= 0, 1, 2…, N-1.Δt is the sampling interval.
The discrete cosine transform which is closely related to the DFT has been used in compression because of its capability of reconstruction with a few coefficients.www.ijacsa.thesai.org The DCT of the signal y(t) can be given by ( 2) and w k as given by ( 3). (2) (3) A discrete sine transform (DST) expresses a sequence of finitely many data points in terms of a sum of sine functions.The DST of the signal y(t) can be given by ( 4). (4) The Walsh transform or Walsh-Hadamard transform is a non-sinusoidal, orthogonal transformation technique that decomposes a signal into a set of basis functions.These basis functions are Walsh functions, which are rectangular or square waves with values of +1 or -1.
The Walsh-Hadamard transform is used in a number of applications, such as image processing, speech processing, filtering, and power spectrum analysis.Like the FFT, the Walsh-Hadamard transform has a fast version, the fast Walsh-Hadamard transform (fwht).Compared to the FFT, the FWHT requires less storage space and is faster to calculate because it uses only real additions and subtractions, while the FFT requires complex values.The FWHT is able to represent signals with sharp discontinuities more accurately using fewer coefficients than the FFT.FWHT is a divide and conquer algorithm that recursively breaks down a WHT of size N into two smaller WHTs of size N / 2. This implementation follows the recursion of the definition 2N Hadamard 2N× matrix H N as given by ( 5). (5) A discrete Hartley transform (DHT) is a real transform similar to the discrete Fourier transform (DFT).If the speech signal is represented by y(t) then the DHT is given by ( 6). ( 6) The Haar transform is derived from the Haar matrix.The Haar transform is separable and can be expressed in matrix form as shown in (7).The Haar basis functions are  When k=0, the Haar function is defined as a constant as in (8). (  When k>0, the Haar function is defined as in ( 9). ( Where 0 ≤ p < log2N and 1 ≤ q ≤ 2p For example, when N=4, we have H 4 as given by ( 10). ( Kekre Transform matrix can be of any size N x N, which need not have to be in powers of 2 (as is the case with most of other transforms including Haar Transform).All upper diagonal and diagonal values of Kekre transform matrix are one, while the lower diagonal part except the values just below diagonal are zero.Generalized N×N Kekre Transform Matrix can be given as in (11).The formula for generating the term Kxy of Kekre transform matrix is given by ( 12). 3.This was then divided into various groups and the sum of the magnitude for each group forms the feature vector.

IV. EXPERIMENTAL RESULTS
The speech samples used in this work are recorded using Sound Forge 4.5.The sampling frequency is 8000 Hz (8 bit, mono PCM samples).Table I shows the database description.The samples are collected from speakers of different age group ranging from 12 to 75 years.Five iterations of four different sentences of varying lengths are recorded from each of the speakers.Twenty samples per speaker are taken.For text dependent identification, four iterations of a particular sentence are kept in the database and the remaining one iteration is used for testing.These speech signals have an amplitude range of '-1' to '+1'.The simulation was done using MATLAB 7.7.0.For DFT, the FFT algorithm was used to calculate the transform coefficients.For DCT, DST and Walsh, the in-built functions in MATLAB were used.To calculate the Hartley Transform coefficients, first the FFT of the real part of speech signal was calculated and then the imaginary part of the complex transform was subtracted from its real part.This is shown in by (13).www.ijacsa.thesai.org(13) For calculating the Kekre Transform, the difficulty was to generate the Transform matrix of the order of 65536×65536, 32768×32768 and 16384×16384 which gave 'out of memory' error.
Instead of computing the transform matrix, the coefficients were calculated as given in ( 14). ( 14) For calculating the Haar Transform coefficients also, the same order of Transform matrix was required.Again here also, the problem was solved by directly calculating the coefficients using the butterfly diagram approach.Thus after transforming the signal into transform domain, the magnitude plot was generated as shown in figure 1.As can be seen from the magnitude plots, the energy concentration is in the lower order coefficients.This concept was utilized and the frequency spectrum was divided into groups and the sum of the magnitude for each group formed the feature vector.The feature vectors of all the reference speech samples were calculated for the different transforms and stored in the database in the training phase.In the matching phase, the test sample that is to be identified is taken and similarly processed as in the training phase to form the feature vector.The stored feature vector which gives the minimum Euclidean distance with the input sample feature vector is declared as the speaker identified.The accuracy of the identification system is calculated as given by (15).
The sentences in the database are of varying sizes.We have performed the simulations for three different lengths of the sentences.In the first case we considered only the first 2.048 sec (16384 samples) of the sentence for each speaker in the training as well as in the testing phase.Figure 2 shows the accuracy obtained for different Transforms for the speech signal of length 2.048 sec (16384 samples).We have begun by taking the entire spectrum as one group and then taking the sum of the magnitude as the feature vector.In this case there is only one element in the feature vector.As can be seen the accuracy is very less for all the transforms.For FFT we get an accuracy of around 6.54%.As we divide the spectrum into more number of groups and then take the sum of each group as the element of the feature vector, the accuracy goes on increasing.For FFT, the accuracy is 93.45% for a feature vector of size 56.Above a feature vector of size 56, the accuracy decreases and we an accuracy of 92.52% for a feature vector of size 88.DCT and DST also show a similar trend, with a maximum accuracy of 89.71% for a feature vector of size 40.
With Walsh transform though the trend is similar, the maximum accuracy is only 79.43% for a feature vector of size 80.Hartley transform shows a behavior similar to FFT and the maximum accuracy is 93.45% for a feature vector of size 56.As can be seen from the magnitude spectrum also, the energy compaction in case of Kekre transform and Haar transform is less than other transforms.This explains the lower performance for both the transforms, Kekre transform 41.12% and Haar transform 60.74%.For the second set of simulations, the first 4.096 sec of the sentence spoken by each speaker was considered in the training as well as in the testing phase.Figure 3 shows the results obtained for this set of experiments.As can be seen from figure 3, the overall trend shown by each transform is the same as in figure 2. But here the effect of the increase in length of the speech signal considered is that the accuracy increases.With FFT, the maximum accuracy 97.19% for a feature vector of size 48.For DCT and DST, the maximum accuracy is 95.32% for a feature vector of size 48.With Walsh transform, the maximum accuracy is now around 85%. Hartley transform gives a maximum accuracy of 96.26% for a feature vector of size 48.There is no significant improvement as far as the Kekre transform and Haar transform are considered.Overall there is a gain in accuracy by increasing the length of the speech signal under consideration.Figure 4 shows the results obtained by increasing the length of the speech signal to 8.192 sec (64536 samples).If the length of the speech signal is smaller than 8.192 sec, then it is padded with zeros to make them all of equal length.As can be seen from the results, there is not much gain over that obtained by considering 4.096 sec.the maximum accuracy is still 97.19% for FFT with feature vector of size 40 now.The trend shown by all the transforms remains the same.
The overall results indicate that the accuracy increases with the increase in the size of feature vector up to a certain point and then it decreases.FFT, DCT, DST and Hartley transforms give very good results.Walsh gives comparatively lower results.Haar and Kekre transform give lesser accuracy compared to all other transforms.This technique of using the magnitude spectrum is very simple to implement and gives comparable results with the traditional techniques used for speaker identification.For the present study we have not used any preprocessing techniques for the speech signal.The database is collected using different brands of locally available microphones under normal conditions.This shows that the results obtained are independent of the recording instrument specifications.

V. CONCLUSION AND FUTURE SCOPE
In this paper we have shown a comparative performance of speaker identification by using seven different transform techniques.The approach used in this work is entirely different from the studies which have been done in this area.Here we are simply using the distribution in the magnitude spectrum for feature vector extraction.Also for feature matching we are using minimum Euclidean distance as a measure.This makes the system very easy to implement.The maximum accuracy is 97.19% with FFT for a feature vector of size 48.The present study is ongoing and we are trying to analyze the transform domain still further, as it has proved to be a promising way for feature vector extraction.Different algorithms for extracting the feature vector using transforms are being developed.

( 7 )
Where [f] is an N×1 signal, [H] is an N×N Haar transform matrix and [F] is an N×1 transformed signal.The transformation H contains sampled version of the Haar basis function h k (t) which are defined over the continuous closed interval t Є [0, 1].

TABLE I .
DATABASE DESCRIPTION