HTML5 Webook
47/194

knowledge, corresponding to a hidden layer of a teacher model, is transferred to a student model to help the student model to capture robust discriminative information from short utterances. To understand the eect of the KDRL method, we plotted the distributions of the internal repre-sentations of the baseline and the KDRL method by using t-Distributed Stochastic Neighbor Embedding (TSNE) [9]. Fig. 3(a) is obtained from the deep convolutional neural network (DCNN) baseline model trained with 1-second utterances. Fig. 3(b) is obtained from the KDRL-based student model that was trained with 1-second utterances. During the student model training, a 4-second utterance-based teacher model was utilized. is gure shows that by reducing the internal representation dierence between short and their corresponding long utterances, the student model could have a higher inter-class variation and lower intra-class variation than the baseline model.We further investigated the KDRL method on the widely used language embedding technique, i.e., x-vector framework [6], and proposed an x-vector extraction ap-proach with adding compensation constraint only for the mean component in the x-vector space. In the proposed vector, the mean component is expected to represent high-level abstract language information while retaining the variance component to encode frame-based local phonetic information for short utterances [10]. Another work was focused on reducing the diculty of optimizing the student model with a xed pre-trained teacher model because the inputs of the student model are short utterances while the inputs of the teacher model are the corresponding longer utterances. Such dierence makes the student model easy to be stuck in a local minimum with a bad performance. In that work, rather than using a xed pre-trained teacher model, we investigated an interactive teacher-student learn-ing method to improve the optimization by adjusting the teacher model with reference to the performance of the student model [11].2.2Robustness for cross-domain/channel probleme recent deep neural network-based LID technolo-gies signicantly improve the accuracy of LID by using a large amount of training data and complex network struc-ture with powerful acoustic feature extraction and abstrac-tion capability. However, in real applications, such techniques oen suer from overtting problems because the recording conditions and speaking styles of a test da-taset are dierent from those of the training dataset, i.e., the cross-domain problem.To reduce the domain discrepancy, we proposed an optimal transport (OT)-based unsupervised neural adapta-tion framework for cross-domain LID tasks [12]. e OT is initial for nding an optimal transport plan to convert one probability distribution shape to another shape with the least eort [13]-[14]. In our work, we adopted the OT distance metric to measure the adaptation loss between source and target data samples. Let p and p are data distribution of the source and target domains, respectively. en, the proposed adaptation methods can be described ig. F3 Representation distributions based on TSNE of the selected hidden layer on an NICT 10 language dataset412-2-5 言語識別・話者識別技術

元のページ  ../index.html#47

このブックを見る