75 Years of Innovation: Speech Recognition

Natural and automated speech recognition for wide-scale commercial application.

The 75 Years of Innovation series highlights the groundbreaking innovations spanning from SRI’s founding in 1946 to today. Each week, SRI will release an innovation, leading up to its 75th anniversary in November 2021.

A Word in Your I/O Port: How Computers Were Made to DECIPHER Natural Human Speech

“What the Nuance speech system is actually doing is taking each sound apart and analyzing it to figure out exactly what the caller said…” Bob Morgen, SRI International, 1996

Computer World Interview on Charles Schwab & Co’s use of speech recognition in their VoiceBroker system.

Scientist in lab coat talking into recording device

Many creatures on planet earth communicate but human beings have made speech central to daily interactions. Unlike many of the planet’s fauna, human speech is made up of complex sentences and composed of grammatical rules that define context; this context places events in time and space. Add natural language ‘nuances’ to this, such as regional accents, and the complexity increases. Whilst these language foibles add great beauty and depth to the human language, they also make it difficult to use natural language when communicating with computers.

For many years, humans have been trying to translate human language into computer language, bringing the two worlds together; human speech being the ultimate in Human-Computer Interaction (HCI). Whilst there have been many attempts at connecting the computer with human beings using the spoken word, here at SRI, we created a system known as DECIPHER the basis of which is now used in commercial settings to make communicating with computers, via voice, a more natural and seamless experience.

This is the part that SRI played in allowing computers and humans to talk to one another.

The Technology Behind Natural Language Speech Recognition

Accents, in the context of speech recognition, can be frustrating for the user and challenging for the developer. Whilst there have been numerous attempts at employing speech recognition in Human-Computer Interaction, most have failed the accent test. SRI took the concepts behind natural language and speech and developed the DECIPHER project to meet this challenge head-on.

In 1989, the “DARPA speech and natural language workshop” was used as a platform to describe the technology behind DECIPHER. The SRI team who worked on the DECIPHER project, explored ways to integrate speech and linguistic knowledge using the HMM (Hidden Markov Model) framework.

A Markov Model is a method used to predict a sequence in a chain of random variables — the prediction is based on the current state. However, in reality, the events (variables) that make up a chain may be hidden. In the context of speech, this is often the ‘part-of-speech’ tags in a given text — we see the words, but the tags are hidden. A Hidden Markov Model (HMM), allows both the observed and hidden parts of speech to be deployed in an algorithm that can be used in a speech recognition program. In the 1980s, the HMM approach used by DECIPHER, revolutionized speech recognition, allowing computers to determine the probability that a sound was, in fact, a word, and to a high degree of accuracy.

The work at SRI International, which explored the use of HMM frameworks for natural language speech recognition, moved the technology on towards a more commercially viable use of speech recognition. This research has since been the foundation for a number of developments.

The Place of Natural Language Speech Recognition in the History of Technology

The Speech Technology and Research (STAR) Laboratory at SRI International began the journey that eventually resulted in a spin-off company, Corona Corporation (renamed, Nuance Communications). Nuance focused on commercializing advanced speech recognition technologies.

In 1995, The SRI Language Modeling Toolkit (SRILM) was developed. This provides the tools to build and apply statistical language models (LMs), primarily for use in speech recognition, statistical tagging and segmentation, and machine translation.

In terms of commercialization of natural automated speech recognition, SRI’s natural language speech recognition software was the first to be deployed by a major corporation. In 1996, Charles Schwab & Co., Inc., used Nuance’s speech recognition technology to allow customers to receive stock quotes over the telephone. One of the key features of the ‘Schwab Discount Brokerage system’, was the ability to recognize English words even when spoken by customers with accents.

In 1997, Nuance Communications developed the first large scale commercial dialog system for United Parcel Services (UPS). UPS used the voice recognition platform to handle very large numbers of inquiries about package status.

In 2006, Nuance, used the “The Amazing Race: Mobile Text Messaging” challenge, to pit their speech recognition technology against the world’s fastest texter: The texter took over 42 seconds whilst Nuance Mobile Dictation took 16 seconds.

Some of the most recent speech recognition technologies to be borne from SRI International are:

● EduSpeak is used in foreign language teaching and corporate training and simulation. It can compare the language learner pronunciation with that of a native speaker. The system uses speaker-independent speech recognition engine that is designed for use by developers of interactive, multimedia learning products to integrate voice input in their products.

● DynaSpeak is a small footprint, high accuracy speaker-independent speech recognition engine that scales from embedded to large scale system use in industrial, consumer, and military products and systems.

As human beings, we love to talk. SRI and spin-off Nuance have built the technology to allow our natural complex speech, even with accents, to be used to communicate with computers.

Now, the world of computing is no longer silent but filled with chatter.