HTML5 Webook
51/194

speakers, respectively. LLR means log-likelihood ratio score calculated as in eq. 4, and JB_net as joint Bayesian model network (the generative model), LDA_net as a linear dis-criminative analysis network which was used for dimen-sional reduction. Dense layers were used to t the functions of model parameters (coupling to the generative model) used in JB transform with parameter sets and (refer [22] for details). And the input feature vectors , are two compared vectors representing two utterances. For discriminative training, we further proposed an objective function based on false alarm and miss measure metrics which are used in detection tasks. e idea was illustrated in Fig. 13. In this gure, by giving a decision threshold , the tradeo between miss rate and false alarm rate could be estimated in optimization. Based on the proposed framework, the two model distributions ( and ) were further separated as shown in Fig. 14. And the speaker verication experiments con-rmed the improved performance. Summary In this paper, we gave an overview of our recent work on both spoken language identication and speaker recog-nition tasks. Our research work focused on both improving the basic theoretical methods and lling the gap between research and development so as to further promote the application of multilingual speech technology.Recently, researchers have made a big progress on LID and SV/SI techniques. However, there are still some chal-lenges that need to be overcome for these techniques to be applied well in real environments: 1.LID and SV/SI techniques oen work as the pre-processing step of a speech processing system; therefore, the real-time factor (RTF) and latency are important fac-tors. Recent state-of-the-art techniques are based on large self-supervised models, for example, wav2vec [23]; there-fore, developing high-performance models with low RTF and latency is necessary, especially when the techniques are running on mobile devices.2.Cross-domain/channel problem is still one of the most challenging problems for deep learning techniques. For LID tasks, it is more sensitive to the cross-channel problem, and the LID model may even extract channel features rather than language features in recognition tasks. In future work, we will continue focusing on improving the robustness of the LID and SV/SI by using self-supervised learning and pre-training techniques. eferencesR1H. Li, B. Ma, and K. A. Lee, “Spoken language recognition: From fun-damentals to practice,” Proc. The IEEE, vol.101, no.5, pp.1136–1159, 2013.2P. Shen, X. Lu, L. Liu, and H. Kawai, “Local Fisher discriminant analy-sis for spoken language identification,” Proc. ICASSP, 2016.3X. Lu, P. Shen, Y. Tsao, and H. Kawai, “Regularization of neural network model with distance metric learning for i-vector based spoken language identification,” Computer Speech & Language, vol.44, pp.48–60, 2017.4A. Lozano-Diez, R. Zazo Candil, J. G. Dominguez, D. T. Toledano, and J. G. Rodriguez, “An end-to-end approach to language identification in short utterances using convolutional neural networks,” Proc. of INTER-4Dense layerLengthNormDense layerDense layerLLR.xxLDA_netJB_netMissFalse alarmProbabilityLLRӨig. F12The proposed two-branch Siamese neural network with coupling of the generative joint Bayesian model structureig. F13The LLR distribution for and conditions with consideration of miss and false alarmOn testing set ig. F14LLR distributions for test data set in and spaces before (a) and after (b) the couple training. : the same speaker hypothesis, : the different speaker hypothesis(a)(b)452-2-5 言語識別・話者識別技術

元のページ  ../index.html#51

このブックを見る