报告题目：Acoustic and Phonetic Information in Speech Processing - Working Towards More Accurate Spoken Language Recognition (语音处理中的声学与音素信息—更加准确的口语语种识别)
报 告 人：Cheung-Chi Leung 博士 （新加坡资讯通讯研究院人类语言技术部）
主 持 人：谢磊 教授/博导
Cheung-Chi Leung received the B.Eng. degree from the Hong Kong University of Science and Technology in 1999, and the M.Phil. and Ph.D. degrees from the Chinese University of Hong Kong in 2001 and 2004, respectively. From 2004 to 2008, he was a Postdoctoral Researcher at the Spoken Language Processing Group, CNRS-LIMSI, Orsay, France. At that period, he worked on the development of new algorithms for speaker recognition, and participated in NIST speaker recognition evaluations. He joined Institute for Infocomm Research (I2R) in Singapore in year 2008, where he is currently a Scientist at the Human Language Technology Department. His current research interests include automatic speech recognition, spoken document retrieval, spoken language recognition and speaker recognition.
At the beginning of this presentation I will briefly introduce some of my recent research works in speech processing. The second part of this presentation is about spoken language recognition (SLR), which refers to the task of automatically determining the language being spoken in a given utterance.
There have been two widely adopted approaches to SLR, which are primarily based on phonotactic features and spectral features, respectively. Phonotactic features are derived from the output of a phone recognizer and usually modeled by n-gram language model or vector space model. Spectral features are captured directly from acoustic signals and modeled typically by Gaussian mixture models (GMM).
Firstly, to deal with the mismatch in phonotactic features between training and test conditions, the use of acoustic model adaptation prior to phone lattice decoding has been proposed. Moreover, combining diversified phonotactic features (e.g., through multiple phone recognizers in parallel) is commonly used in SLR. These motivate us to generate different phonotactic features using diversely adapted acoustic models (from independent mean-only and variance-only MLLR transforms) and investigate the language recognition performance when the diversified features are combined in the same way as combining the diversified features from parallel phone recognizers.
Secondly, the use of shifted-delta multi-layer perceptron (SDMLP) features in an SLR system is investigated. The most commonly used spectral features in SLR systems are known as shifted-delta cepstral (SDC) features. Delta coefficients generally refer to the time derivatives of static coefficients computed from successive frames. Like cepstral features, SDC features are sensitive to speaker and environmental variations. In automatic speech recognition, it was proposed to transform cepstral features into posterior probability features using a multi-layer perceptron (MLP). The MLP features have been found to be more robust than the cepstral features. The SDMLP features are obtained by applying shifted-delta operation to the MLP features produced by an MLP-based phone recognizer. The resulted features can be incorporated straightforwardly into those state-of-the-art SLR systems in the same way as the conventional SDC features. It is expected that the proposed features leverage the robustness of MLP features as well as the benefit of long time span of shifted-delta operation.
Lastly, the use of prosodic features in SLR is introduced. The use of this important component of human speech receives relatively less attention in SLR. Prosody refers to the rhythmic and intonational characteristics, which are observed over a relatively long time span. In contrast with several previous studies that focused only on specific types of prosodic features, the use of many prosodic features extracted with different measurement and normalization methods, referred to as prosodic attributes, is investigated.