Enhanced Feature Network for Monaural Singing Voice Separation

Publication date: Available online 19 November 2018Source: Speech CommunicationAuthor(s): Weitao Yuan, Boxin He, Shengbei Wang, Jianming Wang, Masashi UnokiAbstractDeep Recurrent Neural Network (DRNN) based monaural singing voice separation (MSVS) methods have recently obtained impressive separation results. Most of DRNN based methods directly take the magnitude spectra of the mixture signal as the input feature, which has high dimensionality and contains redundant information. The DRNN based models, however, cannot extract the effective low-dimensional and de-redundant representations from the magnitude spectra. In this paper, we propose an Enhanced Feature Network (EFN) to extract effective representations of the magnitude spectra, i.e., enhanced-feature, for MSVS. The generation of enhanced-feature includes two consecutive stages: (i) modeling the local and contextual information explicitly by Convolutional Neural Network (CNN); (ii) extracting the high-level sequential feature by Highway Network and bi-directional Recurrent Neural Network (RNN). In the first stage, the EFN generates an enhanced-sequence consisting of the high-resolution magnitude spectra and its low-dimensional representations, where the low-dimensional part avoids the high cost of spectra decomposition and the high-resolution part mitigates problems of information loss. In the second stage, the enhanced-sequence is used to extract the enhanced-feature which are more suitable for MSVS. Experiments on the ...
Source: Speech Communication - Category: Speech-Language Pathology Source Type: research