Abstract

We propose a speaker information guided system for zero-shot voice conversion. By designing this system, we aim to obtain intermediate representations for speaker-content disentanglement of speech to better remove speaker information and get pure content information. Accordingly, our proposed framework contains a module that removes the speaker information from the acoustic feature of the source speaker. Moreover, speaker information control is added to our system to maintain the voice cloning performance. The proposed system is evaluated by subjective and objective metrics. Results show that our proposed system significantly reduces the trade-off problem in zero-shot voice conversion, while it also manages to have high spoofing power to the speaker verification system.

This page shows the synthesized audio by our proposed system, which is SIG-VC, and the baseline system, which is AGAIN-VC. All the synthesized audios are generated using only the source utterance and the target utternace.

Sample same as AGAIN-VC demo page

Source: p343_004	Target: p362_010	SIG-VC	AGAIN-VC

Randomly selected samples

Source: p225_066	Target: p228_199	SIG-VC	AGAIN-VC

Source: p228_199	Target: p225_066	SIG-VC	AGAIN-VC

Source: p225_066	Target: p362_101	SIG-VC	AGAIN-VC

Source: p252_005	Target: p362_101	SIG-VC	AGAIN-VC

Source: p228_199	Target: p261_036	SIG-VC	AGAIN-VC

Source: p252_005	Target: p261_036	SIG-VC	AGAIN-VC

Source: p232_079	Target: p252_005	SIG-VC	AGAIN-VC

Source: p228_199	Target: p343_006	SIG-VC	AGAIN-VC

Source: p252_005	Target: p225_066	SIG-VC	AGAIN-VC

Source: p261_036	Target: p228_199	SIG-VC	AGAIN-VC

Source: p252_005	Target: p228_199	SIG-VC	AGAIN-VC

Source: p334_002	Target: p360_001	SIG-VC	AGAIN-VC

Source: p252_005	Target: p232_079	SIG-VC	AGAIN-VC