May 1, 2013

Using multiple versions of speech input in phone recognition

Citation

M. Liberman, J. Yuan, A. Stolcke, W. Wang and V. Mitra, “Using multiple versions of speech input in phone recognition,” in Proc. 38th International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2013, pp. 7591–7595.

Abstract

This study investigates the use of multiple versions of the same speech unit in automatic phone recognition. Two methods were applied to combine multiple utterance versions in decoding: cross forced-alignment and n-best ROVER. The phone error rate was reduced from 15 pct. to 2 pct. on isolated words and from 33 pct. to 19 pct. on TIMIT sentences. The error rate was reduced the most when the second version was added, and less so as each additional version was added. Depending on the language model weight, it might be better to use the language model only in n-best generation, but omit it in scoring the hypotheses applied to the combination methods. N-best ROVER effectiveness may be enhanced by lowering the language model weight.

↓ Download

Using multiple versions of speech input in phone recognition

Abstract

Read more from SRI

SRI and University of Houston receive $3.6M to develop a microreactor to convert carbon dioxide to methanol using renewable energy

Teaching machines to learn like humans could help autonomous systems deal with unfamiliar environments

Office of Special Education Programs extends SRI’s funding for the Center for IDEA Early Childhood Data Systems