linguistic information, emotion, accent, etc. In order to extract linguistic and paralinguistic information eciently, several speech techniques have been developed. Fig. 1 il-lustrates the information encoded in acoustic speech and related speech technologies for decoding the underlying information.As shown in this gure, in speech communication, besides the linguistic content transcribed by the automatic speech recognition (ASR) technique, we can see that sev-eral other non-linguistic patterns should also be identied by dierent techniques, for example, language recognition/identication, speaker recognition, emotion recognition, etc. Improving the detection and recognition accuracy of information is key to the success of the application of multilingual speech translation systems. In this paper, we explain our latest techniques for language and speaker recognition.Spoken Language IdentificationSpoken language identication (LID) is a task to deter-mine which language is being spoken within a speech ut-terance [1]. Recently, LID techniques have been widely investigated and progressed. One conventional LID tech-nique is the i-vector technique with conventional classiers, such as support vector machine (SVM) and deep neural network (DNN). We also investigated the i-vector-based method to further improve the performance by using a local Fisher discriminative analysis and pair-wise distance metric learning [2][3]. Because the performance of i-vector techniques degrades on short utterance tasks, latest works use neural network-based techniques for building LID systems [4]-[6]. Although the neural network-based tech-niques showed their eectiveness on many LID tasks, in order to develop LID techniques for application, there are still some challenges that need to be overcome. In this paper, we introduce our works on two key challenges of the LID tasks: short utterance and cross-domain/channel problems.2.1Knowledge distillation for short utterance LIDLID technology is commonly used in the preprocessing stage of multilingual speech processing systems, such as spoken language translation and multilingual speech rec-ognition. Traditional LID requires longer speech input to obtain better recognition performance, in real-time sys-tems, the usage of longer speech causes delay in the entire system. erefore, improving the performance of LID on short utterances is one of the important tasks in reducing system latency.Compared to long utterances, the distribution of short utterances has a large intra-class variation, which results in large model confusion. Reducing this variation is expected to improve the performance for short-utterance LID tasks. Inspired by previous works of knowledge distillation [7], we proposed a knowledge distillation-based representation learning (KDRL) approach by transferring the representa-tion knowledge of a long-utterance-based teacher model to a short-utterance-based student model [8].e proposed KDRL method is illustrated in Fig. 2. Suppose Θ and Θ are the parameters of the neural networks providing the internal representation of a teacher network and a student network, respectively. e proposed KDRL is based on minimizing the following loss function:L∑1,,, ,,Θ,Θ (1)where and are input samples of the teacher and student networks, respectively, and L is a distance metric of the internal representation dened as ,,Θ,Θ||;Θ; Θ|| (2)where ∥∘∥ is a norm function, for example L1- or L2-norm, which is used to measure the representation distance be-tween the teacher and student models. e u and u are the teacher and student deep nested functions up to their respective selected layers with output of parameter set Θ and Θ , respectively.With the proposed method, the feature representation 2ConvBlockConvBlockConvBlockConvBlockFCConvBlockConvBlockFCConvBlockConvBlockInputFeatureextractionClassificationTeacher Student4sSoft label: Hard label: 2sig. F2The proposed KDRL method for short-utterance LID tasks40 情報通信研究機構研究報告 Vol.68 No.2 (2022)2 多言語コミュニケーション技術
元のページ ../index.html#46