Thursday, February 3, 2011

Speech Technology: Text to Speech

Introduction to Speech Synthesis or Text-to-Speech.

This is the second part of a 3-post series discussing the basic speech technologies that can be used in contact centers. Automatic Speech Recognition (ASR) was presented in the first part, a technology allowing a computer to interpret human speech and convert it to text. The opposite procedure, producing human speech artificially from a piece of text, is called Speech Synthesis or Text-to-Speech (TTS).

How Speech Synthesis works.

The procedure of synthesizing speech from a piece of text generally works in three steps. In the first step, text is converted to a normalized form that consists of only words (eliminating abbreviations, numbers etc.). The speech engine assigns to each word phonetic transcriptions (using a phonetic alphabet). In the final step, the speech synthesizer uses the phonetic transcript to produce the actual sound.

Speech synthesis is primarily done via concatenating segments of recorded speech by actual humans. The produced waveform consists mostly of actual human speech and is only adjusted by processing at the points of segment concatenation. This type of synthesis is the most natural – sounding, producing speech that closely resembles humans.  Larger databases of prerecorded segments highly increase the quality of the output, but also increase processing power and memory requirements. Another frequently used speech technique, with substantially less computing power requirements, is additive synthesis. This method does not use actual human speech samples, but it is based instead on mathematical models. The speech produced this way sounds robotic. However, it is easier to produce and more uniform than concatenated speech, which might produce glitches at the points of concatenation. Some engines use a combination of both methods.

Quality of synthesized speech.

The quality of synthesized speech has been greatly improved over the past few years. Large amounts of money have been invested to fine tune the synthesizers for languages such as English, which apply to very broad markets. Engines produced by market leaders such as Nuance, Loquendo and Acapela can be often indistinguishable from actual human speech, especially when applied over the phone (these companies, and many others, offer free demos for their engines - check them out at their websites). TTS engines are available for various languages; however their quality is usually somewhat lacking compared to English (though still very good).

TTS and ASR usage in IVR.

Synthesized speech is very convenient to use in IVR systems, when a large amount of required prompts has to be changed frequently, providing large cost savings as well as deployment efficiency (synthesized speech can be produced instantly without any human intervention). Its quality is still not the same as prerecorded prompts, but it gets a lot closer for some languages. Thus, many companies opt to use TTS in their voice applications, especially for English language applications. TTS can be combined with ASR for a complete cycle, as the schematic shows.



2 comments:

  1. There is a greek company (actually a spin-off) that specialises in TTS technology. I have heard some samples by them for the greek language and I must say that is highly realistic.
    http://www.innoetics.com/

    ReplyDelete
  2. I have heard their demos, they are indeed very good in Greek language TTS and comparable (if not better) than the large global TTS vendors. The problem with languages such as Greek is that they apply on a relatively limited market, thus making the R&D budgets naturally smaller compared to English for example.

    ReplyDelete