This article was originally published in the August 1993 issue of EXE, a UK magazine for software developers.

Phones and Tones

Telephone technology is undergoing its own computer revolution. John Nixon explains how to build systems that answer phone calls under software control.

Imagine what you could do if your computer answered your telephone. You could greet the caller with a musical jingle. You could give them a list of options, e.g. 'Press 1 for speak to sales, press 2 to request a brochure, press 3 to receive a faxed map of how to get here'. If you knew the telephone number they were calling from, you could prioritise the call: urgent calls could be transferred to a mobile phone, people selling you life insurance could be told where to go, calls from fax machines could be transferred to your fax modem, other calls could be transferred to an operator or asked to leave a message.

Well you don't have to imagine because you can build systems like this today. The technology is called voice processing. In the United States they have been using it for ages. Now with the hardware very affordable and the regulatory regime opening up, it is taking off in the rest of the world. The market in the UK seems to be growing at about 100% per annum. Not surprisingly there is a massive shortage of software engineers with experience in this area.

Applications

The traditional applications are Audiotex, Voicemail and Interactive Voice Response. Audiotex is answering a call and playing a message. Typically the call is premium rate, like an 0898 number. Voicemail is basically a souped up answering machine which can handle loads of users and features. Interactive Voice Response (IVR) is where the caller interfaces with another IT system. The classic example is a home banking service which allows you to query account details or authorise transactions. There are also loads of niche markets with applications limited only by your imagination.

The hottest new source of applications is to combine voice with fax. This creates a multimedia combination where the audio (voice) complements the video (fax). You end up with the best of both worlds: the immediacy and accessibility of voice with the permanency and detail of fax. The best-known fax processing application is the fax document retrieval systems as used, for example, by Intel and Quarterdeck technical support.

BT and Mercury aren't currently allowed to tell the answering party the telephone number of the caller. When they are (this could be by the end of the year) it opens up a number of interesting applications. For example you could build a sales desk or help desk support system which bring up the customer details on screen even before the call is answered. This is an example of what is called CTI (Computer Telephone Integration) which is experiencing particularly rapid growth.

Development

To develop voice systems you will need a suitably equipped PC. The key piece of hardware is the telephone interface which, of course, needs to be BABT approved. They are far more than just audio codecs since they have to be able to recognise tones and speech which requires an onboard DSP. The manufacturers which have products available for the UK are Rhetorex, Dialogic, Natural Microsystems and Staria. Aculab have an ISDN interface card which can terminate 30 telephone lines in one PC.

Software development tools splits into three distinct types: application generators, script languages and C libraries. Application generators and script languages enable people without programming experience to develop simple audiotex or IVR applications. They are very popular with beginners because they ease the learning curve. However like all such programs, development is slower and limited to the type of applications it was designed for. More complex applications such as voicemail systems, complex IVR or CTI applications or anything out of the ordinary would be better off implemented in C. The remainder of this article shows you how.

System Design

The most popular operating system for building voice systems is DOS. All of the major voice hardware manufacturers also support OS/2 and UNIX. Windows NT promises to be an excellent platform for voice systems. However plain Windows 3.1 is very awkward for voice processing systems which demand a real-time support. It is currently necessary to service the device driver at least every 200ms which Windows struggles to deliver. It is possible to overcome some of the real-time problems using some DDK programming but it is a lot of work and I know of some other developers who have given up trying.

One approach which seems to have a lot of potential is to put all the real-time stuff in one PC and put the graphical user interface somewhere else on the network. You could also separate the call processing logic from the voice/fax servers. You end up with a client/server architecture that is well adapted to CTI applications where much of the data is already held on a corporate mainframes.

Drivers

All of the hardware manufacturers provide device drivers for DOS mostly in the form of a TSR program with an interrupt-driven API. The driver either accesses hardware registers or passes a message to DSP firmware on the card. The better hardware products have downloadable firmware which enables the manufacturers to constantly add or improve its functionality.

All of the API functions return almost at once. Many of them start an operation (e.g. playing back a audio file) which will take a while to complete. In those cases, when the operation finishes, the driver enqueues an event. Your software needs to wait for that event before doing anything else on that line. This sort of API is really too low-level for general use (as well as being hardware specific) and you would be well advised to build a layer of software on top.

Multitasking

The first major hurdle you are likely to encounter is how to keep track of more than one telephone call at once. You could hold the details about what is happening on each channel in a set of arrays and have a routine which goes around the channels deciding what to do next. Your application becomes a hugely complex state machine. This is inelegant, hard to debug and makes code reuse very awkward. Nevertheless some programmers have built systems this way.

The most natural way of handling a single call is just plain sequential code: answer the call, play a message, call a menu subroutine, play another message, hang-up etc. The stack frame and the instruction pointer determine the status of the call. Thus for multiple calls you need multiple stacks and instruction pointers. It is precisely for applications like this that multi-tasking was developed. This is the way most voice systems are built.

The most popular multi-tasker used for voice systems is CTASK which is a piece of public domain software developed by Thomas Wagner. DESQview is probably the next most commonly used, mainly because it can also multi-task whole applications. This enables part of the system (e.g. a graphical user interface or a fax server) to be developed as a separate application without having to worry about multi-tasking or memory constraints.

The type of multi-tasking most appropriate for voice systems is thread-level multi-tasking. All the tasks shares the same code segments and default global data segments but have their own stack and register context. These lightweight threads can be scheduled very efficiently which is important for a real-time systems with dozens of processes. The downside is that global resources such as the memory allocation subsystem (i.e. malloc and free) need to be protected with semaphores to stop tasks interfering with each other if a context switch should occur in the middle of manipulating global data structures such as the heap.

Apart from giving each telephone line its own channel, it is convenient to create one more task to dispatch the voice hardware event queue. This task is usually the only one which has to poll the driver, thus increasing the system efficiency.

This program then takes the form of a number of channel handlers which contain the logic. The main task assigns handlers to telephone lines and then blocks waiting for a signal to shut down the system. See Figure 1 for an example 'Hello World!' application.

Figure 1 - Source code for Hello World! program

#include "TELEVOX.H"

void handler(void)
{
  // Answer phone call
  vox_answer();

  // Play "Hello World!"
  vox_play("HELLOWLD.VRP",0);

  // Hangup phone call
  vox_hangup();                         
}

void main(void)
{
  // Initialise the voice engine
  vox_init();

  // Assign handler for voice channel 1
  VOX_SET(1,VOX_SET_DIRECT,handler);

  // Start answering phone calls
  vox_start();

  // Wait for any key or end of handler
  vox_wait_key();

  // Stop all phone calls and shutdown
  vox_stop();
}

Telephones

Another major area in which voice systems presents a learning curve is when integrating with other equipment. For example there is a large number of different types of PBX (telephone switchboard) systems. The latest digital telephone interfaces (ISDN) are much cheaper for big systems but require a dedicated co-processor just to handle the messaging protocol to the telephone exchange.

The biggest problem of all though is with the telephone at the other end of the line. Most voice applications need to give the user options to choose from. The easiest way to do this is to ask the caller to press a key on their telephone keypad. The problem is that not all telephones are the same and what happens when the user presses a key depends upon the telephone and the exchanges in-between.

Old-fashioned telephones dial the digits 0 to 9 by making and breaking the subscriber loop that many times in quick succession (ten times for 0). This is called rotary or pulse dialling. If used during a telephone call, the far end hears a series of clicks. These are hard to distinguish from noise even against a quiet background. The firmware used by Rhetorex (who have led the industry in pioneering new technology and features) can detect digits 3 to 0 but not digits 1 and 2. Worse than that, British Telecom's System X exchanges have a bug which confuses a pulse digit 1 for a Timed Break Recall (the R button on many telephones). This has the effect of interrupting the call and giving out a second dialtone (used for making three-way calls). When the user then hangs up, the exchange rings them back and connects them to the original call. This scenario is, of course, a nightmare for the user. It can be avoided by checking the type of phone first by, for example, asking the user to press a safe digit like 0.

Most business phones and many residential phones support a much better signalling standard called DTMF, popularly known as tone dialling. Quite a number of people could change their phone systems to tones just by flicking a switch on the side. If a key is pressed during a connected telephone call, the telephone still produces the tone and the far end can hear it and determine the key pressed. DTMF (Dual Tone Multi Frequency) uses pairs of audible frequencies to indicate the key pressed. There are sixteen DTMF tones defined: 0-9,*,#,A,B,C,D although A-D only seem to be used by the military. The key marked * is almost universally called star. The key marked # has several names including hash and pound sign (in America). British Telecom use the name 'square' in their recorded announcements, which seems like a sensible choice, given the way they mark the symbol on their telephones.

You can avoid digits altogether and go for voice recognition. Recognizing speech over the telephone is harder than usual because the audio path filters out the higher frequencies which contain a lot of useful information. It is also necessary to be able to recognise the voice of anyone who calls without having heard their voice before. This is called Speaker Independent Voice Recognition. The state-of-the-art in this technology is changing rapidly but you should not expect good results without a lot of development effort and a substantial investment. Most users would prefer speech recognition if it was foolproof, but since it currently isn't, prefer tones. National Westminster Bank, with their excellent home banking service called Actionline, give the caller a choice of methods which seems to me to be the best of both worlds.

Yet another option is to use what is rather colourfully called in the trade 'grunt detection'. This asks for a response using the following dialogue: 'If you want details on ABC, say yes, otherwise remain silent'. The system then pauses and listens to the line. If the caller says anything at all (even no) it is taken as yes. This is easy to implement in the DSP firmware just by summing up the audio energy in the digitised signal.

Voice Dialogues

Voice systems are probably judged most by the quality of their Voice User Interface. A good system is a real joy to use, but a bad one is a complete PITA. A lot of the development effort needs to go into making the dialogue with the user as smooth as possible.

For the very highest audio quality, you should find a professional voiceover artist (called a voice talent) and record in a studio. You should digitise at a high sample rate (e.g. 48kHz) with a true 16-bit linear sample. This can be digitally mixed, filtered and post-processed to the best format which the hardware can actually play.

A musical cue is a effective way of breaking the news to your listener that they aren't actually talking to the human they expected. Sensible use of clip-audio can make your system much more pleasurable to use.

Your voice script needs to take charge of the conversation because if the caller makes decisions about when to speak your system won't be able to understand. The best way to avoid the user speaking in the gaps is to make the gaps shorter. Speaking fast (but clearly) gives a much more professional image than speaking slowly anyway.

When designing the voice dialogue, you need to balance the needs of the raw beginner with those of the power user who keeps on ringing your service. The best way to do this is to allow the user to type ahead. The hardware can detect DTMF digits pressed even while it is playing out a prompt. It is a good policy to arrange for the voice playback sequence to abort immediately if the user has keyed a response (this feature is called cut-through). If you have reason to believe you have lost synchronisation with the user, you should disable cut-through, play an error message, flush the digit queue and then re-enable cut-through. Note that this is the voice equivalent of putting up an alert dialog and then flushing the keyboard buffer on a traditional GUI.

If the user should hang up on you in the middle of a call, your application must be able to cope and terminate the call. It is no good waiting in an infinite loop prompting the caller to press a key. The voice driver will indicate a hangup to your software if it hears a burst of tone generated by the exchange when the other party hangs up. You shouldn't rely on this and all loops in your code should exit after a maximum number of retries.

Conclusion

There is a lot more to building a voice application than meets the eye. I haven't even started to talk about the additional complexities of making outbound calls. There is a lot of detail to get right and you would be well advised to invest in a good C library before you start.

Here are some of the things you should definitely look for:

You might also need the following:


Return to the Telesoft Communications home page.