September 1, 2002

Is the Speaker Done Yet? Faster and More Accurate End-of-Utterance Detection Using Prosody in Human-Computer Dialog

Citation

Ferrer, L., Shriberg, E., & Stolcke, A. (2002). Is the speaker done yet? Faster and more accurate end-of-utterance detection using prosody. In Seventh international conference on spoken language processing.

Abstract

We examine the problem of end-of-utterance (EOU) detection for real-time speech recognition, particularly in the context of a human-computer dialog system. Current EOU detection algorithms use only a simple pause threshold for making this decision, leading to two problems. First, users often pause inside utterances, resulting in a premature cut off by the system. Second, when users really are done, the minimum system wait is always the threshold value, needlessly adding time to the interaction. We have developed a new approach to EOU detection that uses prosodic features to address both of these problems. Prosodic features are modeled by decision trees and combined with an event N-gram language model to obtain a score that measures the likelihood that any nonspeech region is an EOU. We find that this approach dramatically improves both the accuracy and speed of online EOU detection.

↓ Download

Is the Speaker Done Yet? Faster and More Accurate End-of-Utterance Detection Using Prosody in Human-Computer Dialog

Abstract

Read more from SRI

Recognizing the life and work of Mary Wagner

Robots in the cleanroom

SRI research aims to make generative AI more trustworthy