Automatic Speech Recognition (ASR)
Speech / Voice Recognition
Speech recognition (also referred to as voice recognition) is a process by which the elements of spoken language can be recognized and analyzed, and the linguistic message it contains transposed into a meaningful form so that a machine can respond correctly to spoken commands. Voice recognition is distinct from voice identification, which is the capability to identify a specific individual by comparing unknown recorded voices to known voice exemplars to identify similar and dissimilar characteristics.
The "holy grail" of ASR research is to allow a computer to recognize in real-time with 100% accuracy all words that are intelligibly spoken by any person, independent of vocabulary size, noise, speaker characteristics and accent, or channel conditions. Despite several decades of research in this area, accuracy greater than 90% is only attained in commercial when the task is constrained in some way.
Different levels of performance can be attained by unclassified systems. Recognition of continuous digits over a microphone channel (small vocabulary, no noise) can be greater than 99%. If the system is trained to learn an individual speaker's voice, then much larger vocabularies are possible, although accuracy drops to somewhere between 90% and 95% for commercially-available systems. For large-vocabulary speech recognition of different speakers over different channels, accuracy n commercial systems is less than 90%, and processing can take hundreds of times real-time.
Automatic Speech Recognition History
The earliest attempts to devise systems for automatic speech recognition by machine were made in the 1950s. Much of the early research leading to the development of speech activation and recognition technology was funded by NSA, NSF and the Defense Department's DARPA. Much of the initial research, performed with NSA and NSF funding, was conducted in the 1980s.
Kurzweil was founded in 1982 and proposed to use its experience, industry knowledge, and market presence to leverage the production of the interface. In 1985, the company had introduced Kurzweil Voice System, the first 1,000-word discrete-speech recognizer. This interface, adaptable to many applications, allowed the user to control the application by voice without modifying the operating system or software. In 1987, Kurzweil introduced the first 20,000-word discrete-speech recognizer, which was incorporated into Kurzweil Voice Report software and allowed users to create structured reports by voice.
Speech recognition technology was designed initially for individuals in the disability community. For example, voice recognition can help people with musculoskeletal disabilities caused by multiple sclerosis, cerebral palsy, or arthritis achieve maximum productivity on computers.
During the early 1990s, tremendous market opportunities emerged for speech recognition computer technology, yet no company had been able to develop a low cost commericial system that could recognize natural language continuous speech commands. Development of this type of technology presented too high a level of scientific risk to attract private investment. Therefore, in 1994, Kurzweil Applied Intelligence, Inc., applied for and was awarded cost-shared funding from the NIST Advanced Technology Program (ATP) to pursue a three-year development project. With the help of ATP funding, Kurzweil successfully developed fully operational continuous dictation technology.
The early versions of these products were clunky and hard to use. The early language-recognition systems had to make compromises: they were "tuned" to be dependent on a particular speaker, or had small vocabulary, or used a very stylized and rigid syntax. However, in the computer industry nothing stays the same for very long and by the end of the 1990s there was a whole new crop of commercial speech recognition software packages that were easier to use and more effective than their predecessors.
In July 1997, Lernout & Hauspie acquired Kurzweil. That same year, Microsoft invested $45 million in the company, based in part on the work done in the area of a SAPI- compliant speech recognition system. The technology has since been integrated into Lernout & Hauspie's VoiceXpressTM product, which allows voice control of Microsoft and Corel Office software products. In 2001, Lernout & Hauspie encountered financial troubles. The company filed for bankruptcy and was purchased for $39.5 million in assets by ScanSoft, a company known for its OmniPage optical character reader (OCR) scanning software.
In recent years, speech recognition technology has advanced to the point where it is used by millions of individuals to automatically create documents from dictation. Medical transcriptionists listen to dictated recordings made by physicians and other health care professionals and transcribe them into medical reports, correspondence, and other administrative material. An increasingly popular method utilizes speech recognition technology, which electronically translates sound into text and creates drafts of reports. Reports are then formatted; edited for mistakes in translation, punctuation, or grammar; and checked for consistency and any possible medical errors. Transcriptionists working in areas with standardized terminology, such as radiology or pathology, are more likely to encounter speech recognition technology. Use of speech recognition technology will become more widespread as the technology becomes more sophisticated.
Court reporters typically create verbatim transcripts of speeches, conversations, legal proceedings, meetings, and other events when written accounts of spoken words are necessary for correspondence, records, or legal proof. Using the voice-writing method, a court reporter speaks directly into a voice silencer-a hand-held mask containing a microphone. Some voice writers produce a transcript in real time, using computer speech recognition technology. Speech recognition-enabled voice writers pursue not only court reporting careers, but also careers as closed captioners, CART reporters for hearing-impaired individuals, and Internet streaming text providers or caption providers.
Although an ever-increasing number of consumer speech recognition software programs are available, the most popular are Microsoft Windows®-based models that are used to dictate words into a word processor. In recent years, much progress has been made to computer processor speeds, voice recognition technology and database engine query retrieval rates.
The 2000 evaluation of conversational speech recognition over the telephone was part of an ongoing series of periodic evaluations conducted by NIST. These evaluations provide an important contribution to the direction of research efforts and the calibration of technical capabilities. It's clear from the base established in NIST's Conversational Speech Recognition Benchmark Tests that there was still lots of room for improvement, especially in foreign languages. In the 2000 tests, "best" Word Error Rates ranged from 19.3% for Switchboard data and 31.4% for CallHome data, and, for Mandarin, Character Error Rates were 57.1%.
In 1999 DARPA began working with private industry to develop a translator for military medical usage. Early versions of these devices were laptop-computer-sized. The civilian contractor, Marine Acoustics, came back to DARPA with the suggestion to make a hand-held, tactical version of the phraselator. The smaller phraselator was demonstrated and validated for use during the Victory Strike military exercise held in Poland on Sept. 10, 2001.
By early 2003 non-linguist US troops in Afghanistan and Iraq were able to communicate with local citizens by using a paperback-book-sized device called the phraselator. Co-developed by the Defense Advanced Research Projects Agency and private contractors, the phraselator uses computer chips to translate English phrases into as many as 30 foreign language equivalents. Users either speak into the device, which translates the English into the foreign-equivalent phrase, or they can punch a button to call up the desired phrase. The English-speaking operator can speak from a series of phrases ranging from just a few dozen to as high as 3,500 phrases, characterized by such issues as force protection, medical triage and medical first- response. The device was originally developed for military medical usage. Newer devices contain phraseology on refugee reunification and searches for weapons of mass destruction.
Project Babylon is a three-year DARPA program encompassing all military phraselator development. The goal is a two-way phraselator that can translate respondents' answers to users' queries. This two-way phraselator had been publicly demonstrated at an international linguist organization's annual meeting in Berlin, and to the US Senate Armed Services Committee.
As a corporate descendant of Bell Laboratories, Avaya Laboratories is a world leader in speech processing technology. More importantly, Avaya is a leader in the development of products that use this technology in a helpful, reliable manner. There are many examples within the Avaya product line, including the innovative speech recognition adjuncts available for Avaya Contact Center solutions. There are many ways to help employees and customers with speech recognition adjuncts for Avaya Unified Communication Center solutions. These solutions can provide voice access to a wide range of telephony and information management functions, including call control and the ability to manage e-mail, voice messages, calendars, task lists, and contacts. By 2003 more than 1 million businesses worldwide, including more than 90 percent of the FORTUNE 500®, relied on Avaya solutions and services.
Systems use automatic speech recognition in a voice response telephone system to make appointments or place orders. In such a voice response system, after the user speaks the system repeats its understanding and provides the user with an opportunity to verify whether the system has recognized the utterance correctly. This may require several iterations to reach the correct result.
US telephone companies that provide directory assistance respond accurately to caller requests for telephone numbers 93% of the time, according to the December 2004 National Directory Assistance Performance Index, an independent analysis published semi-annually by The Paisley Group, Ltd. The Index is the only tool on the market that provides companies which offer directory assistance services with specific competitive intelligence to track and gauge their performance. Qwest Communications led in calls fulfilled with 95.7% successfully completed compared to the segment average of 93%.
|Join the GlobalSecurity.org mailing list|