Scalable Hybrid Speech Codec for Voice over Internet Protocol Applications

With the advent of various web-based applications and the fourth generation (4G) access technology, there has been an exponential growth in the demand of multimedia service delivery along with speech signals in a voice over internet protocol (VoIP) setup. Need is felt to fine-tune the conventional speech codecs deployed to cater to the modern environment. This fine-tuning can be achieved by further compressing the speech signal and to utilize the available bandwidth to deliver other services. This paper presents a scalable -hybrid model of speech codec using ITU-T G.729 and db10 wavelet. The codec addresses the problem of compression of speech signal in VoIP setup. The performance comparison of the codec with the standard codec has been performed by statistical analysis of subjective, objective and quantifiable parameters of quality desirable from the codec deployed in VoIP platforms. Keywords—VoIP; Speech Compression; Hybrid Speech Codec; ITU-T G.729 codec; db10 wavelet; Statistical Analysis


INTRODUCTION
In the recent years, there has been substantial standardization in the various codec deployment and access technologies for VoIP applications.These standardizations has laid the foundation of the modern 4G technology, which is a packet switched network based on internet protocol and has data rates up to 1 Gbps [1].This compliments the general characteristics of VoIP setup of ease of launch of additional services along with the conventional telephony -voice services [2].The integration of different web-based applications in modern VoIP service had led to the various challenges of Quality of service(QoS) faced by the VoIP application.These challenges are [3], the requirement of high bandwidth, sensitivity to propagation delay, sensitivity to jitter, etc.Hence, to address these challenges, it is desired to build a highly efficient speech compression technique.The compression is desired to conserve the precious resource of bandwidth, so as to enable the amalgamation of other application in the final VoIP package.The activity of compression of the speech signal is carried out by the codecs deployed in the setup.
Currently, various ITU standard codecs are utilized in VoIP setup for seamless interconnectivity across different systems spanning continents.The data-rates provided by these codecs are the function of engineering adjustments between the quality of voice signals, complexity and bandwidth of the codec [4].

Most commercial VoIP application currently operates on the International Telecommunication Union -Telecommunication
Standardization Sector (ITU-T) standardized codec known as ITU-T G.729 codec based on the principle of Conjugate Structure Algebraic Code Excited Linear Prediction (CS-ACELP) [5].The codec is a hybrid type codec which uses the algebraic sum of the fixed codebook gain vector and the adaptive gain to arrive at the actual codebook vector for performing the linear prediction [6].The codec operates at 8kbps and produces the highest quality of speech reproduction known as "Toll Quality" for most practical and real-time conditions and hence widely deployed in all voice-based operations [6].This technology provides optimal complexity and is best suited for multimedia digital simultaneous voice and data communication, hence makes it best suitable for VoIP applications [7].The performance of this codec have been extensively tested on various environment, which resulted in concluding that the codec performs consistently with voice signals and narrowband control signals [8], international and regional speech coding standards, channel noise and over degraded transmission channels [9].However, to accommodate additional services further compression of the speech signal is envisaged.
In the recent years, extensive research has been carried out on the topic of wavelet transforms and its applications in signal processing.The wavelet is defined as a short wave of finite duration, whose average value is zero.It is finite in nature [10].The Wavelet transform represents the signal with very high precision and limited storage requirements [11].Here the signal is de-composed into component-signals resembling sinewaves, having compressed information in both frequency and time domains.The scaling function Φ (z) determines the resolution of the analysis and the actual analysis is performed by the mother wavelet function Ψ (z) [12].The definition of the Wavelet transform of the signal s (z) is [13]: Where s(z)= original signal p = scaling parameter q = translation parameter The wavelet function is given by Where C t are the average coefficients and d j,t are detail coefficients.The resultant signal thus generated from the wavelet coefficients as referred in equation 5 above provides a compressed representation of the original signal.
The codec we propose aims to enhance the performance of the CS-ACELP codec.The codec utilizes the concept of CS-ACELP as well as that of the wavelet transform, thereby providing an ideal combination of compression and standardization for deployment in VoIP setup.
The paper is organized into six sections.Section-I introduces the requirement of further compression of the speech signal in modern VoIP applications.Section -II provides a brief background of speech signal compression using wavelets.The concept of the proposed scalable hybrid speech codec is presented in Section-III.Section-IV defines the performance evaluation parameters.The results of the assessments are presented in Section -V.Section VI presents the conclusion.

II. SPEECH SIGNAL COMPRESSION USING WAVELETS
The wavelet transform is a transformation used to study the temporal and spectral properties of non-stationary signals like speech, audio etc. based on the frequency-time multi-resolution property of wavelets [15].The mother wavelet performs the analysis of the speech signal in wavelet domain.The general criteria for the selection of wavelet family for compression are [16]: a) Availability of minimal support in frequency as well as time domain and b) Availability of a relatively large number of vanishing moments.The greater the number of vanishing moments, quicker is the decay rate of co-efficient leading to compact signal representation [17].Further, they also provide a higher quality of the reconstructed signal, less distortion and a high degree of compression with a trade-off of higher complexity of operation [18].
The selection process of mother wavelet for the analysis of the signal is governed by the consideration of the quality parameters required in the synthesized signal [19].In VoIP application,the desirable quality requirements of the speech signal are naturalness of the speech, intelligibility, pleasantness, possibility of recognition of speaker [20].The desirable parameters of the codecs deployed for such applications include [21][22]: Operation in low bit rate, low complexity, robustness across different speakers and languages, robustness in performance in the presence of channel errors, etc.The latest research in the field of signal processing has identified that Daubechies family of wavelets provides optimal results in the recognition of spoken digits as compared to other wavelets [15,23,24,25] and has the advantage of approximate shift -invariance and better edge representation as compared to real-valued discrete wavelet transform [26][27].Hence, the Daubechies family of wavelets has been considered in the proposed codec.
The Daubechies wavelet family is extensively used for analysis of speech signals.They are orthogonal wavelets, with the highest quantity of vanishing moments per given support.
Here the scaling and the wavelet functions are not defined [28].The family of the wavelets is defined as dbN where db stands for family name Daubechies, N represents coefficients generated, similarly the number of vanishing moments being N/2 [12,29].
Wavelet transform decomposes the signal into Average coefficients or scaling co-efficient indicating low-frequency components and Detail co-efficients or wavelet -coefficients containing high-frequency components.In the speech signal, high frequencies are present at the onset for a very brief period, while lower frequencies are present later for longer periods [30].Wavelet transforms resolve all these frequencies simultaneously localized in time to a level proportional to their wavelength, thereby obtaining localization in time as well as frequency.Recent studies of the wavelet decomposition [31] indicates that less than 5% of the maximum value is present in 90% of the wavelet co-efficient and hence can be treated as redundant for the purpose of signal analysis.Compression of the signal is hence achieved by truncating the redundant signals (low valued co-efficient) to zero and reconstructing the signal using the remaining co-efficient [17].The speech compression technique using wavelet is graphically explained in Fig. 1.
The proposed codec utilizes the Daubechies family of the wavelet, specifically db10 wavelet for the wavelet analysis part of the codec.The work flow diagram of the proposed codec is presented in Fig. 2. In order to compress the bandwidth requirement for the operation of the standard G.729 codec, it is proposed to design a scalable hybrid codec, in which the output of the codec be cascaded to a wavelet-based compression technique to obtain the final compressed reconstructed signal.
The CS-ACELP codec acts as the core-layer for speech signal processing and the wavelet compression acts as the enhancement layer for the required compression.The original signal is fed to a conventional ITU G.729 codec to obtain 8kbps synthesized speech signal.This in-turn is subjected to 5level recursive wavelet decomposition, since beyond the fifth level of wavelet decomposition, no added advantage is available in for signal processing [11].We have chosen Daubechies family of wavelets for the decomposition as they concentrate in their approximation coefficients more than 96% of the signal energy [31].Further, as described in the previous section, Daubechies family of wavelets provides better results in speech recognition, better edge representation for speaker identification and has approximate shift in-variance, which is best suitable for VoIP speech analysis.
The decomposed signal is subjected to thresholding inorder to truncate and de-noise the signal thereby improving the Signal to Noise ratio (SNR) [32].Adaptive or soft thresholding is used in the proposed codec as this provides the efficient method of de-noising depending upon the signal under study [33].
The thresholding procedure determines the magnitude of compression of the signal [34].The input signal is approximated from the truncated coefficients by applying inverse wavelet transform.The signal thus synthesized is the compressed version of the original speech signal.
As can be observed from the work flow diagram, the complexity of the proposed codec would be bit higher because of the requirement of wavelet transform calculations.The thresholding for compression employed is minimax principle thresholding as available in statistics.

IV. PERFORMANCE EVALUATION PARAMETERS
Subjective and objective test are carried out on a standalone ITU -G.729 CS-ACELP based speech codec and the proposed codec to compare their performance.The test is performed using 08 samples of speech of both male and female speakers of English and Hindi languages.The age group of the speakers is 30-35 years and is of Indian ethnicity.Calculation of the Mean Opinion Score (MOS) is carried out to determine the subjective evaluation of the speech codec.In MOS test, the original signal and the reconstructed signals are presented to a user, who then provide an acceptability grading between 1 & 5, where 5 is excellent grade [35].The general rating of CS-ACELP is in the range of 4. 1-4.5[36].The objective performance of the codecs is assessed by measuring the parameters of Perceptual Objective Listening Quality Assessment (POLQA), Compression Ratio (CR), Normalized Root Mean Square Error (NRMSE), SNR and the total response time of the algorithm to process the speech signal sample.POLQA [37] is an ITU-T objective speech quality measurement recommendation described as the P.863 recommendation.It analyzes both the original and reconstructed signal sample by sample in the frequency domain with the temporal alignment of the original and the reconstructed signal to provide the MOS mapping [38].The mathematical expressions of the parameters viz.CR, SNR, NRMSE is given below[39-40]: Compression Ratio is defined as: Where, c(k) is the original signal m(k) is the reconstructed signal Signal to Noise Ratio is defined as: Where is the mean square of the original signal is the mean square difference between the original and synthetic signal.
Normalized Root Mean Square Error (NRMSE) is defined as: Where, c (n) is the original signal, m (n) is the synthetic signal µm(n) is the mean of the original signal.
The bit rate in kbps is determined as [13]: Size of the reconstructed si nal file in kilobits    The quality of the speech codec is evaluated by measuring the MOS, POLQA, Compression Ratio, SNR, and NRMSE for the individual samples.The total response time to process the speech signals is also presented for analysis.These evaluation processes are carried out on Intel i-5 processor with 4GB RAM.
The operational bit-rate observed in the simulation of the proposed codec is 4kbps employing the minimax threshold in MATLAB simulation software.The details of the observation are presented in Table II.The individual results are provided in the following figures.Fig. 3 presents the comparison of the codecs in terms of the MOS, comparison in terms of the Compression Ratio is provided in Fig. 4, Fig. 5 compares the codecs in terms of SNR, analysis of the codecs in terms of NRMSE values are presented in Fig. 6, Fig. 7 compares the codecs in terms of the POLQA and finally the Fig. 8 provides the analysis of the delay response of the codecs.Fig. 9 presents the graphical representation of the original signal and the reconstructed signal output of the proposed codec.The results are summarized in Table III.The performance of the CS-ACELP based standard codec (ITU-T G.729) is compared with the proposed scalable hybrid codec in this paper.It is observed that proposed codec operates at a bit rate of 4kbps and provides greater degree of compression than the standard codec.The compression can further be fine-tuned to incorporate additional application in the final service delivery by altering the threshold selected for truncating the wavelet co-efficients.It is observed that the proposed codec provides an additional delay to the tune of 5ms which is attributed to higher complexity of calculation of the wavelet transform.Further it is observed that the proposed codec provides comparable results in other parameters under test.It is also observed that the proposed codec provides robust performance across speakers and languages.
The additional delay of 5ms may be treated as acceptable on real-time applications, considering the fact that the codec requires lower bit rate of operation than the ITU-T G.729 standard codec while providing comparable performance in terms of MOS, SNR etc.
Based on the above it can be concluded that the proposed codec provides a viable alternative to the currently deployed codecs in the VoIP setup with an additional feature of high degree of compression as desired in the futuristic VoIP deployments.

Figure 1 .
Figure 1.Speech Signal Compression Using Wavelets 729 (CS-ACELP) based codec and the proposed scalable hybrid codec based on CS-ACELP and Daubechies family based wavelet is simulated in MATLAB.The test samples referred in Table-I are iterated against both the codecs and the individual performance is noted.

Figure 9 .
Figure 9. Speech Signal Compression Using Proposed Codec

TABLE I .
DETAILS OF SENTENCES USED IN THE EXPERIMENT

TABLE II .
SUMMARIZATION OF THE BIT RATES OBSERVED USING THE CODECS