Multi-domain adversarial training of neural network acoustic models for distant speech recognition

Publication date: Available online 3 November 2018Source: Speech CommunicationAuthor(s): Seyedmahdad Mirsamadi, John H.L. HansenAbstractBuilding deep neural network acoustic models directly based on far-field speech from multiple recording environments with different acoustic properties is an increasingly popular approach to address the problem of distant speech recognition. The currently common approach to building such multi-condition (multi-domain) models is to compile available data from all different environments into a single train set, discarding information regarding the specific environment to which each utterance belongs. We propose a novel strategy for training neural network acoustic models based on adversarial training which makes use of environment labels during training. By adjusting the parameters of the initial layers of the network adversarially with respect to a domain classifier trained to recognize the recording environments, we enforce better invariance to the diversity of recording conditions. We provide a motivating study on the mechanism by which a deep network learns environmental invariance, and discuss some relations with existing approaches for improving the robustness of DNN models. The proposed multi-domain adversarial training is evaluated on an end-to-end speech recognition task based on the AMI meeting corpus, achieving a relative character error rate reduction of +3.3% with respect to a conventional multi-condition trained baseline and +25.4...
Source: Speech Communication - Category: Speech-Language Pathology Source Type: research