Contact Center Info & News: Speech Technology: Automatic Speech Recognition

Introduction to speech technologies.

Contact centers have been using speech technologies for several years now. Significant breakthroughs have been made in areas such as Speech Recognition, Speech Synthesis, and Voice Verification. These options have resulted in more robust IVR designs and intuitive voice user interfaces that are easier to navigate through. These technologies will be presented in more detail in a 3-post series.

Automatic Speech Recognition basics.

Automatic Speech Recognition (ASR) is a technology that allows a computer to interpret human speech and convert it to text. Speech Recognition should not be confused with Speaker Recognition (which refers to recognizing who is the speaker as opposed to what the speaker said).

The ultimate goal of speech recognition is to be able to convert speech to text with 100% accuracy, regardless of who are the speakers, their accent fluctuations and the language being used. The environmental conditions should not be a factor either (this relates mostly to noise and distortion issues). Despite significant breakthroughs in automatic speech recognition during the past few years, we are still quite far from the ideal situation. However, when imposing some constraints on the recognition process, accuracy can be high enough to make its application feasible in many situations.

How speech recognition engines work.

Early attempts to implement speech recognition were based on template matching. The words to be recognized would be recorded, and a waveform would be produced for each of them. Then the uttered words would be converted to similar waveforms and compared to the archived ones to produce a match. This method was rather inefficient in terms of both required storage and processing power needed to perform the matching (especially as the size of the available vocabulary increased). Template matching was also heavily speaker – dependent and continuous speech recognition was also impossible.

Newer engines are based on phoneme recognition. The utterances are split into small pieces (usually several pieces per second) and the respective waveforms are processed for each piece. The result is a set of frequency bands for each piece, which is then matched to a phoneme. Combining these phonemes, the engine tries to reconstruct the uttered words/phrases. The advantage of this approach is the fact that the total amount of phonemes a human can pronounce is extremely small (compared to the infinite amounts of words and phrases). Phonemes are also common for everyone, despite accent flavors and even languages (though different semantic rules apply on each language – the basic sounds are the same). The matching algorithm is also extremely faster than in the template method.

Phoneme recognition is still far from being perfect. The accent flavor of each human is only one of the issues that affect its success rate. Another important problem is ambiguity. Consider for example someone saying “forty two”. He could mean “42” or “40 2”. Or compare “eye” with “I”. Ambiguities like the one in the first example, concerning numbers, can be solved by introducing rules (constraints), like, for example, that numbers should be spoken one digit at a time. However this constraint causes problems when the IVR asks you to speak your 12-digit customer number. The second type of ambiguities can be addressed by taking into account the context of a phrase. All these inherent problems (and many others that go beyond the scope of this text) limit the accuracy of speech engines. However, using several constraints, accuracy can still be well above 90% in many applications.

The grammar.

A key concept in phoneme-based ASR engines is the grammar. A grammar is a set of rules written in a specific format (often XML-style) that contains all allowed utterances for a specific part of an application. Grammars contain information such as keywords that match to a specific meaning (within a specific context that applies in the part of the application it refers to) and other rules that help solve the aforementioned ambiguities.

The recognition process.

The recognition process involves several steps. When the user speaks to an ASR application, the first step is to define when the utterance starts and stops. This is done to avoid overburdening the engine with analyzing silence and/or noise. The utterance is then analyzed and matched to all allowed phrases in the specific grammar applying to each part of an application. Typically, a result is returned along with a recognition probability. As long as the recognition probability is higher than a predefined threshold, the recognition is considered a success.

Contact Center Info & News

Pages

Monday, January 31, 2011

Speech Technology: Automatic Speech Recognition

1 comment: