X-Vectors: Robust DNN Embeddings for Speaker Recognition

a paper review based on X-Vectors: Robust DNN Embeddings for Speaker Recognition

Abstract

data augmentation to improve performance of DNN
DNN maps variable-length utterances to fixed-dimensional embeddings: call as x-vectors
- embeddings leverage large-scale training datasets (better than i-vectors)
- but, it’s challenging to collect data
- so use data augmentation - added noise and reverberation 반향
augmentation is beneficial in the PLDA(probabilistic linear discriminant analysis) classifier
- but not helpful in the i-vector extractor
- however on the evaluation datasets, x-vector achieve superior performance

Introduction

x-vectors: the representations that are extracted from DNN and used like i-vectors
to show augmenting the training data is a effective strategy
i-vectors
- the standard approach consists of a UBM(universial background model)
- and a large projection matrix T 투영행렬 (learned in an unsupervised way)
- projection maps: high-dimensional statistics (from the UBM) into low-dimensional representation → i-vectors
- PLDA classifier is used to compare i-vectors, enable speaker decisions
DNN are trained as acoustic models for ASR(automatic speech recognition) then used to enhance phonetic modeling in the i-vector UBM
- either ASR DNN replace GMM(Gaussian mixture model)
- or bottleneck features are extracted from the DNN and combined with acoustic features
if the ASR DNN is trained on in-domain data the improvement is substantial
- the need for transcribed training data
early neural networks
- to separate speakers, frame-level representations for Gaussian speaker models
- Heigold: jointly learns an embedding with a similarity metric to compare pairs of embeddings
- Snyder: end-to-end. adapted to a text-independent application and inserted pooling layer to handle variable-length segments
- end-to-end into two parts:
  - a DNN to produce embeddings
  - a separately trained classifier to compare them
  - used: length-normalization, PLDA scoring, domain adaptation techniques
DNN embedding performance is highly scalable with the data (large datasets)
- however, recent studies have shown promising performance with publicly available speaker recognition corpora 말뭉치

Speaker Recognition Systems

two i-vector baselines & the DNN x-vector system

Acoustic i-vector

traditional i-vector system based on the GMM-UBM recipe
features: 20 MFCCs - frame-length 25ms, mean-normalized over a sliding window(3s)
delta and acceleration appended - creates 60 dimension feature vectors
energy-based SAD(speech activity detection) selects features
UBM is a 2048 component full-covariance GMM
600 dimensional i-vector extractor and PLDA for scoring

Phonetic bottleneck i-vector

this i-vector system incorporates phonetic bottleneck features(BNF) from an ASR DNN acoustic model
DNN is a time-delay acoustic model with p-norm nonlinearities
penultimate layer is replaced with a 60 dimensional linear bottleneck layer
excluding softmax output layer, DNN has 9.2 million parameters

The x-vector system

first five layers operate on speech frames, with a small temporal context centered at the current frame t
statistics pooling layer aggregates all T frame-level outputs from layer frame5
- and computes its mean and standard deviation
- aggregates information across the time dimension - so subsequent layers operate on the entire segment
mean and standard deviation concatenated together and propagated through segment-level layers, and softmax output layer. (nonlinearities are ReLUs)

DNN is trained to classify the N speakers in the training data
training example consists of a chunk of speech features (3s avg.) & corresponding speaker level
after training, embeddings are extracted from the affine component of layer segment6 (excluding sofmtmax output layer and segment 7) → total of 4.2 million parameters

PLDA classifier

the representations(x-vectors or i-vectors) are centered, and projected using LDA
LDA dimension was tuned on the SITW development set to 200 for i-vectors and 150 for x-vectors
after dimensionality reduction, the representations are length-normalized and modeled by PLDA
- normalized using adaptive s-norm

⏶ next	NLP 훑어보기: TF-IDF 부터 Transformer 까지	2021. 08. 18
	X-Vectors: Robust DNN Embeddings for Speaker Recognition	2021. 08. 17
⏷ previous	Self-Attention Encoding and Pooling For Speaker Recognition	2021. 08. 17