October 17, 2019

508 words 3 mins read



Unofficial PyTorch implementation of Google AI’s VoiceFilter system

repo name mindslab-ai/voicefilter
repo link https://github.com/mindslab-ai/voicefilter
homepage http://swpark.me/voicefilter
language Python
size (curr.) 1195 kB
stars (curr.) 445
created 2019-03-22


Unofficial PyTorch implementation of Google AI’s: VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking.


  • Training took about 20 hours on AWS p3.2xlarge(NVIDIA V100).

Audio Sample


Median SDR Paper Ours
before VoiceFilter 2.5 1.9
after VoiceFilter 12.6 10.2

  • SDR converged at 10, which is slightly lower than paper’s.


  1. Python and packages

    This code was tested on Python 3.6 with PyTorch 1.0.1. Other packages can be installed by:

    pip install -r requirements.txt
  2. Miscellaneous

    ffmpeg-normalize is used for resampling and normalizing wav files. See README.md of ffmpeg-normalize for installation.

Prepare Dataset

  1. Download LibriSpeech dataset

    To replicate VoiceFilter paper, get LibriSpeech dataset at http://www.openslr.org/12/. train-clear-100.tar.gz(6.3G) contains speech of 252 speakers, and train-clear-360.tar.gz(23G) contains 922 speakers. You may use either, but the more speakers you have in dataset, the more better VoiceFilter will be.

  2. Resample & Normalize wav files

    First, unzip tar.gz file to desired folder:

    tar -xvzf train-clear-360.tar.gz

    Next, copy utils/normalize-resample.sh to root directory of unzipped data folder. Then:

    vim normalize-resample.sh # set "N" as your CPU core number.
    chmod a+x normalize-resample.sh
    ./normalize-resample.sh # this may take long
  3. Edit config.yaml

    cd config
    cp default.yaml config.yaml
    vim config.yaml
  4. Preprocess wav files

    In order to boost training speed, perform STFT for each files before training by:

    python generator.py -c [config yaml] -d [data directory] -o [output directory] -p [processes to run]

    This will create 100,000(train) + 1000(test) data. (About 160G)

Train VoiceFilter

  1. Get pretrained model for speaker recognition system

    VoiceFilter utilizes speaker recognition system (d-vector embeddings). Here, we provide pretrained model for obtaining d-vector embeddings.

    This model was trained with VoxCeleb2 dataset, where utterances are randomly fit to time length [70, 90] frames. Tests are done with window 80 / hop 40 and have shown equal error rate about 1%. Data used for test were selected from first 8 speakers of VoxCeleb1 test dataset, where 10 utterances per each speakers are randomly selected.

    Update: Evaluation on VoxCeleb1 selected pair showed 7.4% EER.

    The model can be downloaded at this GDrive link.

  2. Run

    After specifying train_dir, test_dir at config.yaml, run:

    python trainer.py -c [config yaml] -e [path of embedder pt file] -m [name]

    This will create chkpt/name and logs/name at base directory(-b option, . in default)

  3. View tensorboardX

    tensorboard --logdir ./logs

  4. Resuming from checkpoint

    python trainer.py -c [config yaml] --checkpoint_path [chkpt/name/chkpt_{step}.pt] -e [path of embedder pt file] -m name


python inference.py -c [config yaml] -e [path of embedder pt file] --checkpoint_path [path of chkpt pt file] -m [path of mixed wav file] -r [path of reference wav file] -o [output directory]

Possible improvments

  • Try power-law compressed reconstruction error as loss function, instead of MSE. (See #14)


Seungwon Park at MINDsLab (yyyyy@snu.ac.kr, swpark@mindslab.ai)


Apache License 2.0

This repository contains codes adapted/copied from the followings:

comments powered by Disqus