The syllables of speech contain information about the vocal tract length (VTL) of the speaker as well as the glottal pulse rate (GPR) and the syllable type. Ideally, the pre-processor for automatic speech recognition (ASR) should segregate syllable-type information from VTL and GPR information. The auditory system appears to perform this segregation, and this may be why human speech recognition (HSR) is so much more robust than ASR. This paper compares the robustness of recognizers based on two types of feature vectors: mel-frequency cepstral coefficients (MFCCs), the traditional feature vectors of ASR, and a new form of feature vector inspired by the neural patterns produced by speech sounds in the auditory system. The speech stimuli were syllables scaled to have a wide range of values of VTL and GPR. For both recognizers, training took place with stimuli from a small central range of scaled values. Average performance for MFCC-based recognition over the full range of scaled syllables was just 73.5%, with performance falling to 4% for syllables with extreme VTL values. The bio-acoustically motivated feature vectors led to much better performance; the average for the full range of scaled syllables was 90.7%, and performance never fell below 65%.
|Number of pages
|Proceedings - European Conference on Noise Control
|Published - 2008