Speaker Identification using Row Mean of Haar and Kekre’s Transform on Spectrograms of Different Frame Sizes

In this paper, we propose Speaker Identification using two transforms, namely Haar Transform and Kekre’s Transform. The speech signal spoken by a particular speaker is converted into a spectrogram by using 25% and 50% overlap between consecutive sample vectors. The two transforms are applied on the spectrogram. The row mean of the transformed matrix forms the feature vector, which is used in the training as well as matching phases. The results of both the transform techniques have been compared. Haar transform gives fairly good results with a maximum accuracy of 69% for both 25% as well as 50% overlap. Kekre’s Transform shows much better performance, with a maximum accuracy of 85.7% for 25% overlap and 88.5% accuracy for 50% overlap.


INTRODUCTION
Humans recognize the voice of someone familiar, and are able to match the speaker's name to his/her voice.This process is called Speaker Identification.Speaker Identification falls under the broad category of Speaker Recognition [1] - [3], which covers Identification as well as Verification.Speaker identification determines which registered speaker provides a given utterance from amongst a set of known speakers (also known as closed set identification).Speaker verification accepts or rejects the identity claim of a speaker (also known as open set identification).Speaker Identification task can be further classified into text-dependent or textindependent task [4] - [6].In the former case, the utterance presented to the system is known beforehand.In the latter case, no assumption about the text being spoken is made, but the system must model the general underlying properties of the speaker's vocal spectrum.In general, text-dependent systems are more reliable and accurate, since both the content and voice can be compared [3], [4].With a large number of applications like voice dialing, phone banking, teleshopping, database access services, information services, voice mail, security systems and remote access to computers etc., the automated systems need to perform as well or even better, than humans [7] - [10].Work on Speaker Identification started as early as 1960.Since then many techniques such as filter banks [11], formant analysis [12], auto-corelation [13], instantaneous spectra covariance matrix [14], spectrum and fundamental frequency histograms [15], linear prediction coefficients [16] and long term averaged spectra [17] for feature extraction have been implemented.Some of recent works on speaker identification depend on classical features including cepstrum with many variants [4], sub-band processing technique [18 -21], Gaussian mixture models (GMM) [22], linear prediction coding [23,24], wavelet transform [25 -27] and neural networks [26 -28].A lot of work in this regard has been done.But still there is lack of understanding of the characteristics of the speech signal that can uniquely identify a speaker.
The concept of row mean of the transform techniques has been used for content based image retrieval (CBIR) [35 -38].This technique also has been applied on speaker identification by first converting the speech signal into a spectrogram [39].We have proposed Speaker Identification using row mean of DFT, DCT, DST and Walsh Transforms on the speech signal [40] - [41].
In this paper we have proposed a different approach by using the spectrograms.Row mean of Haar and Kekre's Transforms are taken on the spectrogram of the speech signal taken on different frame sizes.The generalized block diagram of the speaker identification system is shown in Fig. 1.As shown in Fig. 1, the reference signals in the database are first converted into their spectrograms and then the transforms are applied.The feature vectors are extracted and stored.The test signal to be identified is similarly processed and the feature vector is matched with the feature vectors stored in the database.The feature vector of the speaker in the database which gives the minimum Euclidean distance with the test signal is declared as the speaker identified.Section II describes the process of converting the speech signal into a spectrogram.The Haar and Kekre's transforms have been explained in section III.In Section IV, the feature vector www.ijacsa.thesai.orgextraction is explained.Results are discussed in section V and conclusion ion section VI.

II.
SPECTROGRAM GENERATION The first step in the speaker identification system is to convert the speech signal into a spectrogram.A spectrogram is a time-varying spectral representation [42] (forming an image) that shows how the spectral density of a signal varies with time.Spectrograms have been used for speaker Identification since a very long time [43 -45].Spectrograms are usually created in one of two ways: approximated as a filterbank that results from a series of bandpass filters (this was the only way before the advent of modern digital signal processing), or calculated from the time signal using the short-time Fourier transform (STFT).Creating a spectrogram using the STFT is usually a digital process.Digitally sampled data, in the time domain, is broken up into chunks, which usually overlap, and Fourier transformed to calculate the magnitude of the frequency spectrum for each chunk.Each chunk then corresponds to a vertical line in the image; a measurement of magnitude versus frequency for a specific moment in time.The spectrums or time plots are then "laid side by side" to form the image or a three-dimensional surface.This is done using the following steps: 1.The speech signal is first divided into frames, (of sizes 32,64,96,128,160,192,224,256, 292 or 320) with an overlap of 25% or 50%.
2. These frames are arranged column wise to form a matrix.E.g. if the speech signal is a one dimensional signal of 44096×1.We divide this into frames of 256 samples each with an overlap of 25% between consecutive frames i.e. overlap of 64.These frames are then arranged column wise to form a matrix of dimension 256×229.
3. Discrete Fourier Transform (DFT) is applied to this matrix column wise.
4. The spectrogram is then plotted as the squared magnitude of this transform matrix.As can be seen from the spectrograms there is much similarity between two iterations of the same speaker whereas the spectrograms of different speakers vary.Looking at these features we decided to implement the row mean technique.

A. Haar Transform
This sequence was proposed in 1909 by Alfréd Haar [46].Haar used these functions to give an example of a countable orthonormal system for the space of square-integrable functions on the real line [47,48].The Haar transform is derived from the Haar matrix.The Haar transform is separable and can be expressed in matrix form as: The N Haar functions can be sampled at t = (2n+1) Δ, where Δ = T/ (2N) and n = 0, 1, 2, 3…, N-1 to form an N x N matrix for discrete Haar transform.For example, when N=4, we have For N=8

B. Kekre's Transform
Kekre Transform matrix [49,51] can be of any size NxN, which need not have to be in powers of 2 (as is the case with most of other transforms including Haar Transform).All upper diagonal and diagonal values of Kekre's transform matrix are one, while the lower diagonal part except the values just below diagonal are zero.Generalized N×N Kekre Transform Matrix can be given as in (6).
The formula for generating the term K xy of Kekre's transform matrix is given by (7).

IV. FEATURE VECTOR EXTRACTION
The procedure for feature vector extraction is given below: 1. Column Transform (Haar or Kekre's Transform) is applied on the spectrogram of the speech signal.2. The mean of the absolute values of the rows of the transform matrix is then calculated.3.These row means form a column vector (M×1) where M is the number of rows in the transform matrix).4.This column vector forms the feature vector for the speech sample.5.The feature vectors for all the speech samples are calculated for different values of n and stored in the database.Fig. 5 shows the Feature Vector generation technique.

A. Database Description
The speech samples used in this work are recorded using Sound Forge 4.5.The sampling frequency is 8000 Hz (8 bit, mono PCM samples).
Table I shows the database description.Five iterations of four different sentences of varying lengths are recorded from each of the speakers.Twenty samples per speaker are taken.For text dependent identification, four iterations of a particular sentence are kept in the database and the remaining one iteration is used for testing.

B. Experimental Results
The feature vectors of all the reference speech samples are stored in the database in the training phase.In the matching phase, the test sample that is to be identified is taken and similarly processed as in the training phase to form the feature vector.The stored feature vector which gives the minimum Euclidean distance with the input sample feature vector is declared as the speaker identified.

C. Accuracy of Identification
The accuracy of the identification system is calculated as given by ( 8).

( )
x 100 Fig. 6 shows the results obtained by using the two transforms for an overlap of 25% between the adjacent frames while creating the spectrograms of the speech signals.As can be seen from the graphs, the Haar transform gives an average performance of around 60%.The maximum accuracy is obtained for a feature vector size of 32 and 64 (69%)and minimum for a feature vector size of 160 (51%).As the feature vector size is increased further, the accuracy drops and there is no improvement.For Kekre's transform, the average accuracy is around 83.33%, with a maximum accuracy of around 85% for a feature vector size of 256 and minimum accuracy for feature vector size of 64 to 160 (82%).Fig. 7 shows the results obtained by using the two transforms for an overlap of 50% between the adjacent frames while creating the spectrograms of the speech signals.Here also Haar transform gives an average accuracy of around 60%.The behavior of haar transform for both the cases is much similar.For Kekre's transform, the average accuracy is slightly more i.e. 85.7%.The maximum accuracy is 88.5% for a feature vector size of 256.Overall Kekre's transform gives much better results as compared to Haar transform.

VI. CONCLUSION
In this paper we have compared the performance of Haar and Kekre's transforms for speaker identification for two different cases (25% and 50% overlap).Haar transform gives an average accuracy of around 60% for both the cases.Accuracy does not increase as the feature vector size is increased from 64 onwards.Kekre's transform gives an accuracy of more than 80% for both the cases.The maximum accuracy obtained for Kekre's transform 88.5% for a feature vector size of 256.The present study is ongoing and we are analyzing the performance on other transforms.

Fig. 2 (
Fig. 2 (a) shows a sample speech signal from the database.Fig.2 (b) shows the spectrogram plotted for the speech signal of fig.2(a) with a frame size of 256 with a overlap of 25% between adjacent samples.Fig. 2 (c) shows the spectrogram plotted for the same speech signal with an overlap of 50% between adjacent samples.Fig. 3 shows the spectrograms generated for three different speakers for two iterations.Fig. 3a & Fig 3b are the spectrograms of the same speaker 1 for the same sentence for two iterations.Similarly Fig. 3c & Fig. 3d are for speaker 2 and Fig. 3e & Fig. 3f are for speaker 3.As can be seen from the spectrograms there is much similarity between two iterations of the same speaker whereas the spectrograms of different speakers vary.Looking at these features we decided to implement the row mean technique.

Figure 1 .
Figure 1.Speaker Identification System Fig. 2 (a) shows a sample speech signal from the database.Fig.2 (b) shows the spectrogram plotted for the speech signal of fig.2(a) with a frame size of 256 with a overlap of 25% between adjacent samples.Fig. 2 (c) shows the spectrogram plotted for the same speech signal with an overlap of 50% between adjacent samples.

Fig. 3
Fig. 3 shows the spectrograms generated for three different speakers for two iterations.Fig. 3a & Fig 3b are the spectrograms of the same speaker 1 for the same sentence for two iterations.Similarly Fig. 3c & Fig. 3d are for speaker 2 and Fig. 3e & Fig. 3f are for speaker 3.As can be seen from the spectrograms there is much similarity between two iterations of the same speaker whereas the spectrograms of different speakers vary.Looking at these features we decided to implement the row mean technique.

Figure 2 .
Figure 2. Speech Signal and its Spectrogram III.TRANSFORM TECHNIQUES For the present work on Speaker Identification, we have used two transform techniques, namely Haar transform and Kekre's Transform.The two transforms have been explained in this section.
Where f is an NxN image, H is an NxN Haar transform matrix and F is the resulting NxN transformed image.The transformation H contains the Haar basis function h k (t) which are defined over the continuous closed interval t Є [0,1].The Haar basis functions are  When k=0, the Haar function is defined as a constant h 0 (t)=1/√N (2)  When k>0 , the Haar function is defined by Where 0 ≤ p < log2N and 1 ≤ q ≤ 2p Speech signal b.Spectrogram of frame size 256 with a 25% overlap c.Spectrogram of frame size 256 with a 50% overlap(5)

Figure 3 .
Figure 3.Spectrograms for three different speakers.a & b for speaker 1,c & d for speaker 2 and e & f for speaker 3 for the text , "All great things are only a number of small things that have carefully been collected together".

Fig. 4
Fig. 4 shows eight waveforms generated using (3) for N=8.Writing this in matrix form we get 8x8 Haar matrix.

Figure 6 .Figure 7 .
Figure 6.Performance comparison of Haar and Kekre's Transform for a overlap of 25%