Abstract
Electrolarynx (EL) is a medical device to generate speech for people whose biological larynx is removed. However, the speech produced by the EL sounds unnatural and unintelligible due to the monotonous pitch and the mechanical excitation of the EL device. This paper proposes an end-to-end voice conversion method to enhance the EL speech to make it more natural and intelligible. We adopt a speaker-independent automatic speech recognition model to extract bottleneck features as the intermediate feature for enhancement. Multiple conversion stages are involved in our system: the bottleneck feature of EL speech is mapped by a parallel non-autoregressive model to the corresponding one of the normal speech. Then another conversion model maps normal speech's bottleneck feature directly to normal speech's Mel-spectrogram, followed by a MelGAN-based vocoder to convert the Mel-spectrogram into waveform. Experimental results show that the proposed method achieves state-of-the-art results with an impressive performance on naturalness and intelligibility.