Voice Conversion with SI-DNN and KL Divergence Based Mapping without Parallel Training Data

Publication date: Available online 30 November 2018Source: Speech CommunicationAuthor(s): Feng-Long Xie, Frank K. Soong, Haifeng LiAbstractWe propose a Speaker Independent Deep Neural Net (SI-DNN) and Kullback- Leibler Divergence (KLD) based mapping approach to voice conversion without using parallel training data. The acoustic difference between source and target speakers is equalized with SI-DNN via its estimated output posteriors, which serve as a probabilistic mapping from acoustic input frames to the corresponding symbols in the phonetic space. KLD is chosen as an ideal distortion measure to find an appropriate mapping from each input source speaker’s frame to that of the target speaker. The mapped acoustic segments of the target speaker form the construction bases for voice conversion. With or without word transcriptions of the target speaker’s training data, the approach can be either supervised or unsupervised. In a supervised mode where adequate training data can be utilized to train a conventional, statistical parametric TTS of the target speaker, each input frame of the source speaker is converted to its nearest sub-phonemic “senone”. In an unsupervised mode, the frame is converted to the nearest clustered phonetic centroid or a raw speech frame, in the minimum KLD sense. The acoustic trajectory of the converted voice is rendered with the maximum probability trajectory generation algorithm. Both objective and subjective measures used for evaluating voice co...
Source: Speech Communication - Category: Speech-Language Pathology Source Type: research