HTML5 Webook
49/194

Speaker RecognitionSpeaker information is useful for many real speech applications, for example, multi-speaker meetings, inter-views, or dialogs, as well as speaker authentication for se-curity access [16]. In most applications, besides recognizing the content of speech based on the ASR technique, speaker identity information should also be recognized based on the automatic speaker recognition technique. ere are two basic tasks in speaker recognition: One is speaker verication (SV), and the other is speaker identi-cation (SI). e conventional pipeline in constructing such speaker recognition system is composed of front-end speaker feature extraction and backend speaker classier modeling. Front-end feature extraction tries to extract ro-bust and discriminative features to represent speakers, and the backend classier tries to model speakers with the extracted features for either classication or verication. Obviously, how to extract speaker representation is essential for achieving robust performance.3.1Deep speaker embedding for speaker recognitionOne of the most representative features of front-end speaker extraction is the i-vector [17]. In i-vector extrac-tion, speech utterances with variable durations can be converted to xed dimension vectors with the help of Gaussian mixture models (GMM) on probability distribu-tions of acoustic features. Due to the success of deep learning techniques in speech and image processing, sev-eral alternative speaker features have been proposed, e.g., d-vector [18] and x-vector [19]. In particular, x-vector is widely used as one of the speaker-embedding representa-tions in most state-of-the-art frameworks [19]. e basic model architecture for speaker embedding is shown in Fig. 7. In this gure, dierent types of neural network ar-chitectures could be applied, e.g., dense-connected feedfor-ward network (FFN), convolutional neural network (CNN), time delay neural network (TDNN), etc. e task in model training is for speaker recognition with speaker identity as target labels, and the pooling layer output is used as feature representation. As illustrated in Fig.7, the advantage of x-vector repre-sentation is that the model for x-vector extraction could be eciently trained with a large number of speech samples from various speakers. Moreover, in order to explore robust speaker information, data augmentation with various noise types and signal-to-noise ratios (SNRs) could be easily applied in model training [19]. e extracted speaker embedding feature could show excellent clustering proper-ties. An example is shown in Fig. 8 (samples for 50 speak-ers are shown).is cluster distribution is a feature projection based on the TSNE [9]. From this gure, we can see that speech samples are well separated based on their speaker identi-ties. We believe that tasks related to speaker information based on this representation could obtain good perfor-mance. 3.2Hybrid generative and discriminative backend modeling on speaker embeddingBased on speaker embedding (e.g., x-vector), various tasks could be constructed based on dierent back-ends related to specic tasks, for example, a probabilistic linear discriminant analysis (PLDA) [20] or joint Bayesian (JB) [21] modeling on the speaker embedding feature for vari-ous tasks as shown in Fig. 9.31123…123…4(|1:)(|1:,1:−1)−1Speaker IDsPooling and embedding featureClean and noisy (with augmentation)ig. F6The proposed RNN-transducer-based language embeddingig. F7Deep speaker embedding (x-vector) for speaker feature extraction432-2-5 言語識別・話者識別技術

元のページ  ../index.html#49

このブックを見る