Abstract
We propose a speaker information guided system for zero-shot voice conversion. By designing this system, we aim to obtain intermediate representations for speaker-content disentanglement of speech to better remove speaker information and get pure content information. Accordingly, our proposed framework contains a module that removes the speaker information from the acoustic feature of the source speaker. Moreover, speaker information control is added to our system to maintain the voice cloning performance. The proposed system is evaluated by subjective and objective metrics. Results show that our proposed system significantly reduces the trade-off problem in zero-shot voice conversion, while it also manages to have high spoofing power to the speaker verification system.
This page shows the synthesized audio by our proposed system, which is SIG-VC, and the baseline system, which is AGAIN-VC. All the synthesized audios are generated using only the source utterance and the target utternace.