Friday, August 12, 2011

DTMF vs Speech enabled IVR: an in-depth examination - part 2

Transition from DTMF to Speech and issues that arise

In the previous post we briefly discussed the differences between DTMF – enabled and speech – enabled IVR systems. We will now turn to the transition of an IVR application based on DTMF to a similar application powered by a speech recognition engine and various considerations that have to be taken into account while doing so:

Speech – powered applications have to be complicated to justify the investment. It is clearly neither cost effective nor really more efficient to spent excessive amounts of money into a two-layer application with 3-4 options on the first menu and 2-3 options on each submenu. These can be implemented very nicely with DTMF and the user gets served quickly enough. Thus, if your self-service application is simple and small, it is currently best to use DTMF.

For complicated applications though, using speech recognition is vastly superior in terms of efficiency and quality. And this is the case even in situations when the correct recognition % achieved by the speech recognition engine is even below 50%! The reason is that the user of an automated system actually wants to minimize the interaction time. A typical user will definitely prefer entering the same information twice or even three times and get done in 1 minute total, rather than having to navigate menus and listen to irrelevant information for 2 minutes before being able to quickly and accurately enter the information once.

The following example is using a (randomly created for this purpose) application flow tree complex enough to showcase the difference in implementation logic between DTMF and speech.

In the tree appearing to the right, the leaves are the final services the application offers. The information retrieval and announcement services are highlighted in yellow, and the services that the customer has to enter information are highlighted in orange.

In a DTMF powered application, each menu has to be presented hierarchically with the users having to listen first to the options 1-5 then after they select a submenu and being presented with all the options below it they go to the next submenu etc. Typically the user can navigate back to the previous menu or the start menu by using * and # keys or some number.

Speech enabled application on the other hand allows the user to directly jump to any sub-tree they wish, or directly access a service (leaf of the tree). The user may also jump at any point during their navigation to any service with one action, without having to pass through the hierarchy. Traversing across the tree requires, of course, the user to be able to know the available options otherwise the options have to be presented again in a hierarchical manner. As soon as the user tries the application a few times though, service times can be severely lowered.

Let’s assume that a caller wants to perform actions 1.3.3 and 5.1.1.2. In the DTMF style application they would have to go through 3 menus for the first item then jump to start and then go through 4 more menus to the second item. This procedure will never be improved, no matter how experienced the user is (save for the time to listen to prompts which can be eliminated via barge-in). That is a minimum of 8 steps required. In a speech enabled application though, an experienced user can jump directly from the initial menu to the first item and then from there jump directly again to the second item without even having to go to the start menu. In this case we can achieve the same result with 2 steps.So, for this particular example, supposing we have a 50% average recognition success, the experienced system user is still served roughly twice as fast as with the 100% accurate DTMF.

The example mentioned above showcases quite clearly the advantages speech recognition can bring to advanced IVR users. However, inexperienced users that interact with the system for the first time will typically spend more time learning how to work with it. This is part of the learning process that is inherent in any new technology being rolled out to the general public, and it typically takes some time before the new system becomes more efficient than the old, for the average users.


Tuesday, August 9, 2011

DTMF vs Speech enabled IVR: an in-depth examination

I have recently been heavily involved in a rather large scale deployment of a new customer care IVR system which is gradually replacing an old DTMF-based system. Based on that experience I would like to elaborate a bit on the differences between DTMF-based and speech-recognition-based systems and highlight some concerns that have came up while deploying the speech-enabled IVR. Since there are a lot going on in such systems to make them work, this text will span more than one post.

The system we worked on is an Avaya Media Processing Server (MPS) IVR (a recent acquisition by Nortel) powered up by Nuance speech recognition engine, however the concepts discussed below should apply to a large degree to any platform.

The characteristics of DTMF

Let’s start with DTMF and its characteristics; DTMF is a powerful way of entering and transmitting information through telephony that has been with us for many decades. It has a lot of advantages that made it prevalent in IVR systems up until very recently, with the two most important being the following:


  • DTMF is very simple to implement. It reuses the same technology as classic telephony and it is very simple to integrate into branching logic on the IVR platform. It is also very quick to process, relying on the telephony infrastructure already in place.
  • It is very accurate. If the user is slightly focused, accuracy can be easily close to 100% despite external conditions such as noise.


Despite these advantages, DTMF does also come with a bunch of drawbacks that really limit its potential:
DTMF is limited in capacity. You can enter so many distinct tones as there are buttons in the phone keypad. While this is sufficient for entering numbers either as data or as options for branching logic, DTMF is incapable of allowing the user to enter more complicated and detailed pieces of information. This severely reduces the capabilities and services that can be deployed.

Excessive use of branching logic in large menus with many submenus makes the IVR application extremely clunky. Menus that contain more than 3-4 different options are too cumbersome for the user. Navigation is also time-consuming, since for example to reach a service that lies in the third layer of an application, the user has to go through three menus with various options each. This can result in slow servicing times.

Finally, using DTMF requires the user having free hands to push the buttons. This is not very helpful when you are on the move or otherwise engaged.

The characteristics of Speech Recognition


On the other side of the table, there is speech recognition. It is a relatively old technology that has only recently reached maturity. Speech recognition systems effectively solve the major problems DTMF suffers from; they allows for very complex input options (not just numbers), they offer vastly superior navigation options and can be used without involving one’s hands. Furthermore, speech-enabled systems feel more natural to interact with. On the other hand, their accuracy is usually a lot less than 100%, performance can be severely affected by noise and they are currently a lot more expensive and complicated to deploy, compared to DTMF.

While initially it seems that it is a more or less an equal tradeoff of pros and cons between the two options, the truth is that speech recognition systems are far superior, assuming a minimum application complexity. The main reason is that most of the drawbacks they come with (which are the strengths of DTMF) simply don’t matter enough! This is a very bold statement that will be extensively discussed in a follow-up post with examples.