a paper review based on X-Vectors: Robust DNN Embeddings for Speaker Recognition

Abstract

  • data augmentation to improve performance of DNN
  • DNN maps variable-length utterances to fixed-dimensional embeddings: call as x-vectors
    • embeddings leverage large-scale training datasets (better than i-vectors)
    • but, it’s challenging to collect data
    • so use data augmentation - added noise and reverberation 반향
  • augmentation is beneficial in the PLDA(probabilistic linear discriminant analysis) classifier
    • but not helpful in the i-vector extractor
    • however on the evaluation datasets, x-vector achieve superior performance

Introduction

  • x-vectors: the representations that are extracted from DNN and used like i-vectors
  • to show augmenting the training data is a effective strategy
  • i-vectors
    • the standard approach consists of a UBM(universial background model)
    • and a large projection matrix T 투영행렬 (learned in an unsupervised way)
    • projection maps: high-dimensional statistics (from the UBM) into low-dimensional representation → i-vectors
    • PLDA classifier is used to compare i-vectors, enable speaker decisions
  • DNN are trained as acoustic models for ASR(automatic speech recognition) then used to enhance phonetic modeling in the i-vector UBM
    • either ASR DNN replace GMM(Gaussian mixture model)
    • or bottleneck features are extracted from the DNN and combined with acoustic features
  • if the ASR DNN is trained on in-domain data the improvement is substantial
    • the need for transcribed training data
  • early neural networks
    • to separate speakers, frame-level representations for Gaussian speaker models
    • Heigold: jointly learns an embedding with a similarity metric to compare pairs of embeddings
    • Snyder: end-to-end. adapted to a text-independent application and inserted pooling layer to handle variable-length segments
    • end-to-end into two parts:
      • a DNN to produce embeddings
      • a separately trained classifier to compare them
      • used: length-normalization, PLDA scoring, domain adaptation techniques
  • DNN embedding performance is highly scalable with the data (large datasets)
    • however, recent studies have shown promising performance with publicly available speaker recognition corpora 말뭉치

Speaker Recognition Systems

  • two i-vector baselines & the DNN x-vector system

Acoustic i-vector

  • traditional i-vector system based on the GMM-UBM recipe
  • features: 20 MFCCs - frame-length 25ms, mean-normalized over a sliding window(3s)
  • delta and acceleration appended - creates 60 dimension feature vectors
  • energy-based SAD(speech activity detection) selects features
  • UBM is a 2048 component full-covariance GMM
  • 600 dimensional i-vector extractor and PLDA for scoring

Phonetic bottleneck i-vector

  • this i-vector system incorporates phonetic bottleneck features(BNF) from an ASR DNN acoustic model
  • DNN is a time-delay acoustic model with p-norm nonlinearities
  • penultimate layer is replaced with a 60 dimensional linear bottleneck layer
  • excluding softmax output layer, DNN has 9.2 million parameters

The x-vector system

  1. first five layers operate on speech frames, with a small temporal context centered at the current frame t
  2. statistics pooling layer aggregates all T frame-level outputs from layer frame5
    • and computes its mean and standard deviation
    • aggregates information across the time dimension - so subsequent layers operate on the entire segment
  3. mean and standard deviation concatenated together and propagated through segment-level layers, and softmax output layer. (nonlinearities are ReLUs)
  • DNN is trained to classify the N speakers in the training data
  • training example consists of a chunk of speech features (3s avg.) & corresponding speaker level
  • after training, embeddings are extracted from the affine component of layer segment6 (excluding sofmtmax output layer and segment 7) → total of 4.2 million parameters

PLDA classifier

  • the representations(x-vectors or i-vectors) are centered, and projected using LDA
  • LDA dimension was tuned on the SITW development set to 200 for i-vectors and 150 for x-vectors
  • after dimensionality reduction, the representations are length-normalized and modeled by PLDA
    • normalized using adaptive s-norm