Automatic Speech Recognition Techniques
Today, speech recognition research is interdisciplinary, drawing upon work in fields as diverse as biology, computer science, electrical engineering, linguistics, mathematics, physics, and psychology. Within these disciplines, pertinent work is being done in the areas of acoustics, artificial intelligence, computer algorithms, information theory, linear algebra, linear system theory, pattern recognition, phonetics, physiology, probability theory, signal processing, and syntactic theory.
Speech recognition systems are generally classified as discrete or continuous systems that are speaker dependent, independent, or adaptive. Discrete systems maintain a separate acoustic model for each word, combination of words, or phrases and are referred to as isolated (word) speech recognition (ISR). Continuous speech recognition (CSR) systems, on the other hand, respond to a user who pronounces words, phrases, or sentences that are in a series or specific order and are dependent on each other, as if linked together.
A speaker-dependent system requires that the user record an example of the word, sentence, or phrase prior to its being recognized by the system; that is, the user "trains" the system. Some speaker-dependent systems require only that the user record a subset of system vocabulary to make the entire vocabulary recognizable. A speaker-independent system does not require any recording prior to system use. A speaker independent system is developed to operate for any speaker of a particular type (e.g., American English). A speaker adaptive system is developed to adapt its operation to the characteristics of new speakers.
ISR systems present a considerably easier task for machines than do CSR systems. Speaker-dependent systems are simpler to construct and use and are more accurate than speaker-independent systems. As a result, the focus of early voice recognition systems was primarily speaker-dependent isolated word systems that used limited vocabulary. At the time, overcoming the restrictions in the state of technology required a greater focus on human-to-computer interaction. The challenge was to identify how improved speech recognition technology could be used to support the enhancement of human interaction with machines.
Most modern speech recognition uses probabilistic models to interpret a sequence of sounds. Hidden Markov models, in particular, are used to recognize words. To increase word accuracy in speech recognition, language models are used to capture the information that certain word combinations are more likely than others, thus improving detection based on context.
Automatic speech recognition performs poorly in noise, especially with crosstalk from other speakers. Humans are very tolerant of noisy environments, but automated speech recognition degrades rapidly as noise increases. Signal corruption from background speech in multiple-speaker environments is particularly troublesome. Biologically inspired neural networks show promise for noise-tolerant spoken-language interfaces in such situations.
An important element in the creation of a speech recognition system is the size of the vocabulary. The vocabulary of a speech recognition system affects the complexity, processing requirements, and the accuracy of the system. Obviously, it is much easier to look up the definition of one of 20 words in 20-word dictionary rather than one of hundreds of thousands of words in a Webster's dictionary. That is essentially what the speech recognition software is doing; accessing a dictionary of phonemes and words. [A phoneme, although difficult to describe, is basically the smallest unit of phonetic speech that distinguishes one word from another. Every word can be broken down into units of individual sounds that make up that word. Each of these units is a phoneme.]
Another important qualifier in the determination of the complexity of a speech recognition system is the type of speech that the recognition systems uses; discrete or continuous. In a discrete speech system, the operator must pause between each word, which makes the speech recognition task much easier. This is the simplest form of recognition to perform, because the end points of words are easier to find, and the pronunciation of a word tends not to affect others. Thus, because the occurrences of words are more consistent they are easier to recognize.
A continuous speech system operates on speech in which words are connected together, i.e., not separated by pauses. Continuous speech is more difficult to handle because of a variety of effects. First, it is difficult to find the start and end points of words. Another problem is "coarticulation." The production of each phoneme is affected by the production of surrounding phonemes, and similarly the starts and ends of words are affected by the preceding and following words. The recognition of continuous speech is also affected by the rate of speech (fast speech tends to be harder).
A dictionary specifies the legal acoustic models for individual speech sounds for all possible words to be used in the network. Note that a dictionary may contain multiple pronunciations of the same word. There are usually two types of dictionaries, a system dictionary and a user dictionary. The system dictionary contains a non-modifiable list of words that the development software will recognize. It is sometimes possible to select a subset of the full system dictionary, which will be faster to load for the speech recognition process. User dictionaries are usually created from scratch by the user, but can also be subsets of the system dictionary.
The sets of acoustic models can be "trained" on speech recorded from a multiple number of users. These model sets take into account such variations as pronunciation (dialect), accent, etc., for the individual speakers. It is important to train the model sets in an environment similar to the one that will be used in the recognition process, i.e., do not train a model set using a headset microphone if the recognition environment is going to involve access via telephone.
Very simply, a speech recognition process involves processing raw acoustic data through a recognizer, which matches the acoustic data with a set of acoustic models using a decoder to generate a recognition hypothesis.
The speech recognition process begins with the digital sampling of the verbalized input of the user. This input might be from a source gathered offline and stored on a disk, or directly from a real-time source of sampled data such as a workstation's audio input.
The next stage is acoustic signal processing, where the digitized verbal input is split into a series of discrete "observations." The hope here is that these observations are a faithful representation of the verbalized input from the user. Most techniques include spectral analysis; e.g., Linear Predictive Coding (LPC) analysis, Mel Frequency Cepstral Coefficients (MFCC), cochlea modeling, and others.
An attempt is then made to match the discrete observations with a known set of acoustic models. Each model at its core represents a phoneme. A set of models is combined into a word or phrase using a dictionary. The dictionary specifies the pronunciation of each word as a set of phonemes. This step can be accomplished by the use of a number of different processes such as Hidden Markov Models (HMM), Dynamic Time Warping (DTW), Neural Networks (NNs), expert systems, as well as combinations of these techniques. HMM-based systems are currently the most commonly used and most successful approach.
During the recognition phase, the existing, trained acoustic models are compared with the processed voice input (discrete observations). A decoder (e.g., Viterbi, Baum-Welch) is used to match the voice input with the most likely acoustic models as the path is made through the network. The decoder transcribes the continuous speech input into a sequence of textual symbols which an application can directly process. The goal is to match up the symbols into recognizable groups by comparing them with the acoustic speech models.
The end product of this phase is the speech recognition process' best approximation of the verbal input, in a form that can be utilized by an Application Programming Interface (API). It is the API that utilizes the decoded verbal input from the speech recognition process to allow an action to be performed on the original verbal input.
|Join the GlobalSecurity.org mailing list|