Low-dimensional, auditory feature vectors that improve vocal-tract-length normalization in automatic speech recognition

J. J M Monaghan*, C. Feldbauer, T. C. Walters, R. D. Patterson

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

The syllables of speech contain information about the vocal tract length (VTL) of the speaker as well as the glottal pulse rate (GPR) and the syllable type. Ideally, the pre-processor for automatic speech recognition (ASR) should segregate syllable-type information from VTL and GPR information. The auditory system appears to perform this segregation, and this may be why human speech recognition (HSR) is so much more robust than ASR. This paper compares the robustness of recognizers based on two types of feature vectors: mel-frequency cepstral coefficients (MFCCs), the traditional feature vectors of ASR, and a new form of feature vector inspired by the neural patterns produced by speech sounds in the auditory system. The speech stimuli were syllables scaled to have a wide range of values of VTL and GPR. For both recognizers, training took place with stimuli from a small central range of scaled values. Average performance for MFCC-based recognition over the full range of scaled syllables was just 73.5%, with performance falling to 4% for syllables with extreme VTL values. The bio-acoustically motivated feature vectors led to much better performance; the average for the full range of scaled syllables was 90.7%, and performance never fell below 65%.

Original languageEnglish
Pages (from-to)477-482
Number of pages6
JournalProceedings - European Conference on Noise Control
Publication statusPublished - 2008
Externally publishedYes

Fingerprint

Dive into the research topics of 'Low-dimensional, auditory feature vectors that improve vocal-tract-length normalization in automatic speech recognition'. Together they form a unique fingerprint.

Cite this