a paper review based on X-Vectors: Robust DNN Embeddings for Speaker Recognition
Abstract
- data augmentation to improve performance of DNN
- DNN maps variable-length utterances to fixed-dimensional embeddings: call as x-vectors
- embeddings leverage large-scale training datasets (better than i-vectors)
- but, it’s challenging to collect data
- so use data augmentation - added noise and reverberation 반향
- augmentation is beneficial in the PLDA(probabilistic linear discriminant analysis) classifier
- but not helpful in the i-vector extractor
- however on the evaluation datasets, x-vector achieve superior performance
Introduction
- x-vectors: the representations that are extracted from DNN and used like i-vectors
- to show augmenting the training data is a effective strategy
- i-vectors
- the standard approach consists of a UBM(universial background model)
- and a large projection matrix T 투영행렬 (learned in an unsupervised way)
- projection maps: high-dimensional statistics (from the UBM) into low-dimensional representation → i-vectors
- PLDA classifier is used to compare i-vectors, enable speaker decisions
- DNN are trained as acoustic models for ASR(automatic speech recognition) then used to enhance phonetic modeling in the i-vector UBM
- either ASR DNN replace GMM(Gaussian mixture model)
- or bottleneck features are extracted from the DNN and combined with acoustic features
- if the ASR DNN is trained on in-domain data the improvement is substantial
- the need for transcribed training data
- early neural networks
- to separate speakers, frame-level representations for Gaussian speaker models
- Heigold: jointly learns an embedding with a similarity metric to compare pairs of embeddings
- Snyder: end-to-end. adapted to a text-independent application and inserted pooling layer to handle variable-length segments
- end-to-end into two parts:
- a DNN to produce embeddings
- a separately trained classifier to compare them
- used: length-normalization, PLDA scoring, domain adaptation techniques
- DNN embedding performance is highly scalable with the data (large datasets)
- however, recent studies have shown promising performance with publicly available speaker recognition corpora 말뭉치
Speaker Recognition Systems
- two i-vector baselines & the DNN x-vector system
Acoustic i-vector
- traditional i-vector system based on the GMM-UBM recipe
- features: 20 MFCCs - frame-length 25ms, mean-normalized over a sliding window(3s)
- delta and acceleration appended - creates 60 dimension feature vectors
- energy-based SAD(speech activity detection) selects features
- UBM is a 2048 component full-covariance GMM
- 600 dimensional i-vector extractor and PLDA for scoring
Phonetic bottleneck i-vector
- this i-vector system incorporates phonetic bottleneck features(BNF) from an ASR DNN acoustic model
- DNN is a time-delay acoustic model with p-norm nonlinearities
- penultimate layer is replaced with a 60 dimensional linear bottleneck layer
- excluding softmax output layer, DNN has 9.2 million parameters
The x-vector system
- first five layers operate on speech frames, with a small temporal context centered at the current frame t
- statistics pooling layer aggregates all T frame-level outputs from layer frame5
- and computes its mean and standard deviation
- aggregates information across the time dimension - so subsequent layers operate on the entire segment
- mean and standard deviation concatenated together and propagated through segment-level layers, and softmax output layer. (nonlinearities are ReLUs)
- DNN is trained to classify the N speakers in the training data
- training example consists of a chunk of speech features (3s avg.) & corresponding speaker level
- after training, embeddings are extracted from the affine component of layer segment6 (excluding sofmtmax output layer and segment 7) → total of 4.2 million parameters
PLDA classifier
- the representations(x-vectors or i-vectors) are centered, and projected using LDA
- LDA dimension was tuned on the SITW development set to 200 for i-vectors and 150 for x-vectors
- after dimensionality reduction, the representations are length-normalized and modeled by PLDA
- normalized using adaptive s-norm