A Unique Glottal Flow Parameters based Features for Anti-spoofing Countermeasures in Automatic Speaker Verification

The domain of Automatic Speaker Verification (ASV) is blooming with growing developments in feature engineering and artificial intelligence. Inspite of this, the system is liable to spoofing attacks in the form of synthetic or replayed speech. The difficulty in detecting synthetic speech is due to recent advancements in the Voice conversion and Text-to-speech systems which produce natural, indistinguishable speech. To prevent such attacks, there is a need to develop robust spoof detection systems. In order to achieve this goal, we are proposing estimation of Glottal Flow Parameters (GFP) from speech of genuine speech and synthetic spoof samples. The GFP are further parameterized using time, frequency and Liljencrants–Fant (LF) models. Along with GFP features, the Linear Prediction Cepstrum Co-efficient (LFCC) and statistical parameters are computed. The GFP features are investigated to prove their usefulness in detecting spoofed and genuine speech. The ASV spoof 2019 corpus is used to test the framework and evaluated against the baseline models. The proposed spoof detection framework produces an Equal Error Rate (EER) of 2.39% and tandem Detection Cost Function (t-DCF) of 0.0562 which is found to be better than the state-of-the art technique. Keywords—Spoof detection; synthetic speech; glottal excitation; speaker verification; voice conversion; text-to-speech


I. INTRODUCTION
The speaker verification system acknowledges the true identity of a known speaker while dismissing the unknown speaker's voice [1]. These systems are bound to be exposed to the infiltrators through spoofing attacks. The intrusion in the form of synthetically generated speech results into spoofing attack on the ASV system. Such an environment is termed as Logical Access (LA) scenario while the one with replay speech is a Physical Access (PA) scenario [2]. These attacks are a result of continuous efforts by researchers in field of Voice Conversion (VC) and Text-to-Speech (TTS) [3]; since their aim is to generate clean, human like speech -with little to no variation in the synthetic speech. Hence, tackling these attacks through means of efficient features and machine learning algorithms are a desideratum. The studies in anti-spoofing or countermeasures have increased tremendously with increasing attacks on main-frame systems such as phone-banking theft, unauthentic access to workplaces or even smart phone devices where speech is used as the identity [3], [4]. So, as authentication is no more limited to finger prints and retina scans, the speech based spoofing attacks are growing and catching attention of many researchers for developing robust spoofing detection schemes. Moreover, the countermeasures developed so far are less than a decade old and still have a scope of improvement in terms of reducing the False Acceptance ratios. Most of the research is based on specific type of attack [5], [6] while few others consider all the types of attack making them universal detectors [7], [8].

II. RELATED WORK
The anti-spoofing measures are solely dependent on two prime techniques: feature representation and spoofed speech classification. The studies on features are significant and need to be based on the nature of input speech which is either genuine or spoofed. Thus, the task is restricted to differentiate between spoofed and genuine speech through appropriate use of features for extracting relevant information from the test speech. The spectral features employed for spoofing detection are Mel-Frequency Cepstral Co-efficient (MFCC) [9], [10], Magnitude and Phase based features [11] such as Log Magnitude Spectrum, Residual Log Magnitude Spectrum, Group Delay (GD), Modified GD (MGD), Instantaneous Frequency (IF), Baseband Phase Difference and Pitch Synchronous Phase (PSP). Additionally, the known fact that the MFCCs represent the human auditory system as it utilizes perceptually similar filter bank analysis, is found to be performing not so well in the anti-spoofing environment [11]. To counter that, the Inverse MFCC (IMFCC) is proposed for spoof detection because it comprises of feature contents which are absent in MFCC [12]. Furthermore, the CFCCIF, CQCC based features were also proposed; out of which CQCCs are considered to outperform in the ASV Spoof 2017 challenge [13], [14].
The speech signal generated by lungs act as a source of air that stipulates excitation from glottis resulting into resonating frequencies traveling through the vocal tract out of the mouth. Hence, the lip radiation is also considered as the part of the production mechanism but is stable. Thus analytically, the contents available from speech may be in the form of meaning of the utterance and individual speaker's identity. For designing the counter-measure to detect an attack, the extraction of speaker related information and artefacts inserted due to synthetic speech is a crucial step. Both identity of speaker and meaning of sample can be interpreted at different areas of the production mechanism like shape of Vocal Tract (VT), nature of Glottal Excitation (GE) or flow and prosody parameters [28]. The work in this research is based on analysing the source of the speech production model, i.e. glottal source estimation technique. The research in [20] used IAIF estimation for glottal flow estimation but focused more on the classifiers (SVM and ELM). Along with this, we consider the VT information which captures the speaker's individuality in the form of LFCC [29] with statistical parameters. Also, the few studies have shown glottal excitation to be independent of VT [30] while some have shown inter-dependency between them [31], [32], [33]. Hence, we found it necessary to explore glottal excitation components of genuine and spoof speech. Furthermore, the scope of the research is also confined to LA attacks as synthetic speech production is becoming more accessible and capturing naturality. This is due to the fact that open source tools and datasets are available for researchers to explore leading to more versatile synthetic speech generators [5], [23], [34], [35].
Thus, the research approach is divided in a three-fold process and is listed as follows:

1) Exploring the Glottal Flow Parameters (GFP) using
Quasi-Closed Phase estimation and LF modelling to capture the inaudible artefacts present in the synthetic speech through careful representation of source excitation process. 2) Investigating the performance of these GFP features using objective metrics in the GMM framework. 3) Conducting comparative analysis of the proposed features with the Baseline LFCC features [2].
The article is organized as follows: Section III describes the Glottal excitation estimation based Feature Extraction while Section IV elaborates the Proposed Anti-spoofing based speaker verification system. The Section V presents the experimental results while overall discussion and conclusion are summarized in Sections VI and VII, respectively.

III. GLOTTAL EXCITATION ESTIMATION BASED FEATURE EXTRACTION
The estimation of source of the speech by filtering out the effects of lip radiation and vocal tract is termed as Glottal inverse filtering (GIF). The first research on glottal source estimation began in 1950s by Miller [36]. Since then, improvements were seen in representing glottal source, but it has been difficult to compute due to lack of ground truth like no EGG information available. Furthermore, studies directed towards utilizing synthetic speech to work on in order to avoid the need for ground truth [37]. In the spoof detection task, this research is analyzing natural as well as synthetic speech (which is indeed spoofed speech). The GIF analysis was initially based on closed phase, iterative and adaptive approaches [38]. The Closed phase estimation is based on the covariance criteria for Linear Prediction (LP) analysis as some samples which are present in closed phase. Another approach that requires prior knowledge of shapes of both vocal tract as well as glottal excitation is the Iterative Adaptive Inverse Filtering (IAIF) [31]. The mixed phased approaches like Complex Cepstrum analysis [39] and zeros of Z-transform (ZZT) [40] are contrasting to the earlier estimation techniques as they consider segregation of glottal and vocal tract information through transformation in another domain (such as frequency or z-domain). Furthermore, the Mean-Square Phase (MSP) is used to approximate the Liljencrants-Fant (LF) model [41]. Most of approaches mentioned so far perform well for low pitched male voices and deteriorate for higher fundamental frequencies (f0) [38]. This research is based on Quasi-Closed Phase (QCP) glottal estimation that uses Weighted Linear Prediction (WLP) in place of covariance criteria as shown in Fig. 1. It is found that this kind of estimation is more robust in the closed phase parts of the speech samples [38]. Also, so far studies have been conducted on VT contents of the speech whereas the glottal excitation is equally important as it bears the source of speech production system.
The speech produced because of convolution in time domain, s m turns out to be product of individual frequency responses of GE source, G(z) and VT filter T (z). Thus, speech signal S(z) in z-domain is given in Equation 1 So, using the conventional LP approach for portraying the WLP model for m th speech utterance as shown in Equation 2 Where, e m is excitation signal with j th b j prediction coefficient of order L. The significant difference between WLP and LP analysis is that the WLP yields the product of weight function W m with square of the excitation signal given in the form of Total energy residual E (in Equation 3): For auto-correlation criteria, the limits m 1 = 1 and m 2 = M +L ; M is the length of frame. The weight function, W m is given in Equation 4 using Attenuated Main Excitation (AME) function.
The Glottal Flow waveform obtained from the raw speech samples of genuine speech (Fig. 2a), TTS synthetic speech (Fig. 2b) and the VC speech (Fig. 2c)  are based on open quotient (OQ), speed quotient (SQ), and closing quotient (ClQ) while the amplitude parameters are based on Amplitude Quotient (AQ). Lastly, the frequency domain parameters such as Parabolic spectrum parameter (Psp), difference value between amplitude of first and second harmonic (H1-H2) and Harmonic Richness Factor (HRF) which are adapted from [37] are also computed as a part of GFP features.

IV. PROPOSED ANTI-SPOOFING SPEAKER VERIFICATION FRAMEWORK
A spoof detection or anti-spoofing algorithm must be designed by carefully choosing the right features which represent the spoof and genuine speech in order to make the differentiation task easier. Hence, the choice of appropriate classifier too, is crucial. To summarize the spoof detection system, there two primary phases, namely the training phase and the testing phase as shown in Fig. 3. The training phase involves extracting the GFP, LFCC and statistical features after pre-processing of the raw speech data. These features are fed to the GMM classifier using associated labels. The individual models for genuine and spoofed samples are used in the testing phase to categorize the unknown test sample. The details steps: parameterization, model training and decision making algorithm are described in further sub-sections.

A. Parameterization
During the training stage, the speech samples are low pass filtered along with framing. The silence and pauses at the beginning and end of the sample are removed using Voice activity detection [20]. The VT filter information is represented using LFCC with 20ms frame size. Furthermore, the statistical parameters like mean, coefficient of variance (CoV) and Interquartile range (IQR) are combined with the LFCC parameters to form a feature matrix. The LFCC features are found to be more robust than MFCC in terms of noisy speech as it performs well in the higher frequency region (comprising of VT features). The order of LFCC is 19 and its delta and double delta variants are also computed. The VT filter features alone are not sufficient to represent the speech, especially when the naturally spoken speech needs to be differentiated as against spoofed speech. According to the speech production model, the remaining glottal excitation information is represented using GFP estimation through QCP GIF technique using 30ms frame length. The time-based parameters such as OQ, SQ and ClQ are computed in Equation 5: where, L 01 is the opening phase length (expressed in time), L c is closing phase length and L is glottal cycle length which in terms of period. The AQ is computed using Equation 6 where A max is glottal peak and d min is minimum value of derivative of glottal time waveform. The Normalized AQ (NAQ) is given using Equation 7 N AQ = AQ L Apart from these, the Quasi OQ (QOQ), HRF, Psp and H1H2 are also used as a part of GFP. Additionally, the LF model parameters such as E e , R a , R g and R k are also considered as they are dependent on the linear source filter model (Table I shows details of parameters). A subjective test is performed (Fig. 4) to discern proficiency of GFP descriptors, box plot analysis is used to display the numerical values of the genuine and spoof speech samples for AQ, QOQ, HRF and H1H2.
From Fig. 4 for AQ and QOQ, it is found that the IQR for genuine and spoof speech are different while for H1-H2 and HRF the IQR values between genuine and spoof speech are slightly similar. Hence, the AQ and QOQ have higher discrimination properties than H1-H2 and HRF.

B. Model Training and Decision Making Algorithm
The GFP parameters, LFCC features, and statistical parameters together form a feature matrix for each sample of the entire data in the spoofed and genuine category individually. In this study, we use the GMM based binary classifier with 512 mixtures for modelling the class labels according to genuine or spoofed speech. The GMM model in case of genuine speech samples λ gen while for the spoofed sample is λ sf . The GMM are considered to capture higher classification accuracy due to their ability to capture generality in case of unknown data samples. For a particular test utterance T , the Log Likelihood R = log(p(T |λ gen )) − log(p(T |λ sf )) Where, the likelihood scores obtained from GMM for genuine and spoofed speech samples are s gen = log(p(T |λ gen )) and s sf = log(p(T |λ sf )) respectively.

V. EXPERIMENTAL RESULTS
The research is based on ASV spoof 2019 dataset [42] which was the part of ASV spoof challenge held in 2019. The corpus consists of 20 speakers and more than fifty thousand samples in LA attack samples. For training we used 2580 genuine and 22800 spoof samples while 23400 samples are used for development purpose as shown in Table II. So far, this is the only dataset with such a wide variety of samples and attack types. The state-of-the-art LFCC-GMM technique is considered as the baseline approach [2]. Furthermore, the process of binary classification leads to two error types: False Acceptance Ratios (FRR) and the False Rejective Ratios (FRR). A standalone spoof detection scheme may falsely reject a genuine sample assuming it to be spoofed or falsely accept an imposter sample assuming it to be genuine. Based on these errors, the DET is used to measure performance of the features used. The operating point obtained from the DET curve is the EER which is another metric for evaluating the spoof detection performance [2]. Lastly, the normalized tandem-Detection Cost Function (t-DCF) [2] is also used to measure performance as it does not require pre-setting of decision threshold and is given in Equation 9 norm t − DCF = p F R + a p F A Where p F R probability for scores which are less than set threshold considered as rejected while p F A is the probability for scores which are greater than the set threshold (a) considered as accepted test sample. The ASV and CM scores performance, DET Curve and CM results using EER and t-DCF plots are depicted from Fig. 5 to Fig. 7 and Table III. The Fig. 5 depicts probability density function (pdf) of ASV and CM scores. The CM scores are for Baseline (red) and the Proposed model (blue). Both models are bimodal except in case of the Proposed model the density has smaller peak in comparison to a more definitive peaks for Baseline model signifying lower pdf for the baseline with two opposite distributions. Fig. 6 and Fig. 7 show the t-DCF and DET curves respectively. The t-DCF is lower for proposed technique in comparison to the baseline model. Also, the DET curve shows slightly lower EER for proposed technique in contrast to the baseline method (shown in Table III).

VI. DISCUSSION
The GFP-based features are unique and not much explored in the spoof detection domain. The Glottal-flow plots for synthetic speech highlight the significant difference in the amplitude, time, and frequency information from the genuine speech. This ascertains the importance of proposed GFPs in addition to VT parameters in developing countermeasures. Also, the selection of the right GFPs is crucial. Thus, we plotted the box plot to investigate which parameters are more reliable than the others. For instance, the AQ captures the glottal peaks accurately and due to the synthetic nature of spoofed speech, the amplitude information is found to be deviating from genuine speech. While on the other hand, the HRF represents the quality of speech which may perceptually similar. Hence, detecting spoofed speech from genuine is slightly difficult with HRF and similar parameters. In contrast, the GFPs on the whole when used in the conjunction with VT parameters show improvement in the EER and t-DCF when compared to the baseline technique. This might be due to the fact that missing glottal flow information is now fulfilled by the 31 QCP Glottal features that represent amplitude along with with time-frequency contents; and also due to the fact that the high pitched voices are now easily detected with these proposed GF features leading to better results.

VII. CONCLUSION
The main role of a counter measure is to prevent any unauthentic access. For doing so, the kind of attack and spoofed speech must be analysed. Hence, in this research, we focused on the synthetic speech attack using unique QCP estimation for extracting GFP from both genuine as well as spoof speech. Since, the GFP represents the source of attack samples, the minute differentiation between genuine and spoof speech was magnified with GFP. As a result, GFP certainly added the information contents to the features set by further reducing the EER from 2.70% for Baseline LFCC to 2.39%. So, the FAR and FRR can be reduced by extracting relevant information from spoofed speech. Additionally, the GMM classifier captured the non-linearities quite well as the conjugative contribution of GFP and LFCC provided sufficient data for better classification accuracy. Also, this research can further be extended for replay speech where noise based artifacts may be present and GFPs are found to perform significantly well in noisy speech. In addition to the improvements obtained by employing the QCP based GF parameters, there are two prime limitations of these features: first, the QCP based GIF requires precise estimation of GCI. This can be explored in the future by investigating more appropriate GCI estimation techniques. Secondly, the unstable filter parameters contribute to computational complexity while extracting these features. From future prospects, the prosodic features may be explored in conjunction with source filter parameters for further reducing the EER and improving the countermeasure performance.