In this gure, speech input is divided into several seg-ments with a certain duration in speaker-embedding feature extraction. e extracted speaker embedding feature is modeled based on a probabilistic model. If the task is for SV, we need to estimate a log-likelihood ratio (LLR) for a hypothesis test as: ,|,| (4)where , are two compared feature vectors, and are hypotheses as the same speaker or dierent speakers. e generative probabilistic model is robust to various noise and unknown speakers but lacks in discriminative power. In our study, we proposed to integrate the genera-tive model with a discriminative learning framework for improving the performance [22]. As the generative model focuses on class-conditional feature distributions while the discriminative model focuses on classication boundaries, the generative model could have a good generalization for short utterances (but less discriminative power), the dis-criminative model has a high discriminative capacity (but less generalization ability to short utterances). Fig. 10 shows the two dierent focuses of the two types of models. In this gure, only two classes are shown. By coupling the gen-erative model in a discriminative neural network learning framework, we could combine both the advantages of generative and discriminative models to constrain large model variation.Correspondingly, the probabilistic graphic network could be represented as in Fig.11. In this gure, denotes a feature variable, y means a speaker ID label. In the generative model, the probability measures the likelihood with a given speaker ID label y to generate an acoustic observation . In the discriminative model, the probability represents the posterior probability by given acoustic ob-servation to estimate a speaker ID label. In Fig. 11, and are model parameter sets for the generative and dis-criminative models, respectively. In most studies, these two model parameter sets are estimated based on dierent methods. In our study, we proposed to couple the genera-tive model with a discriminative learning framework in model parameter estimation. e proposed model frame-work is shown in Fig. 12. In Fig. 12, the model framework was adopted for the SV task with two hypothesis labels and , i.e., the two compared utterances are from the same and dierent ig. F8 Speaker clustering based on speaker embedding featureC1C2px|yC1px|yC2Discriminative Boundaryxypy|x,ӨxyӨpx|y,ӨӨDeep speaker EmbeddingJB/PLDAOverlapped segmentsig. F10Generative model (focuses on class conditional feature distributions indicated by dash-circles) vs. discriminative model (focuses on discriminative class boundary represented as solid curve). C1 and C2: class 1 and 2, respectivelyig. F11Probabilistic graphic model for generative (left) and discriminative models (right)ig. F9Speaker embedding and backend modeling based on generative probabilistic models44 情報通信研究機構研究報告 Vol.68 No.2 (2022)2 多言語コミュニケーション技術
元のページ ../index.html#50