cogint.ai

Improving Dialogflow Phone Calls by Adding Noise

Chad Hart — Tue, 10 Mar 2020 11:15:00 GMT

Generally technology has advanced to remove noise and make audio clearer with a higher signal to noise ratio. While this makes sense in most contexts, various forms of noise have become part of the user experience when making a phone call. Dialogflow is largely made for a smartphones and smart speakers environment, so subtleties of the phone medium can be lost. Have you ever notice how a phone call is never truely silent? There’s a reason for that. Fortunately, it isn’t too difficult to add noise to a Diaogflow voicebot for the phone.

In this post I will cover how to add two kinds of noise:

Filler noise - using noises to mask silent periods, and
Ambient noise - background noise to simulate a specific environment

Read-on for details and examples.

Filler Noise

Let’s look at filler noise first, which is generally easier.

Why add filler noise?

We usually try to avoid noise, so why would we want to add it to our bot?

Absolute silence is bad

User’s don’t like to hear silence on the phone. One byproduct of the original analog phone system is that users would always hear some electrostatic background noise. This noise eventually became a feature as users became used to interpreting that this noise meant they were connected, even if the other party wasn’t talking. When Voice over IP (VoIP) systems came about, comfort noise was actually engineered into the call. Most VoIP systems actually inject artificially generated comfort noise into a call that would otherwise be perfectly silent when someone isn’t speaking.

Noise is a non-verbal form of communication

Smart speakers generally have LED indicators that tell the user the device heard them and is processing. This visual indicator provides functions as a subtle feedback mechanism, letting the user know the device is doing something. However, if you hook that same virtual assistant up to a phone call, you lose that visual indicator and are limited to audio signals. Furthermore, phone assistants are trying to mimic human agents. Humans often need time to think or respond to prompts. They respond instantly, even if with a speech disfluency - i.e. “uh, ok - let me check that”.

Oftentimes users also hear some background call center noise or the agent typing in between interactions which also functions as a type of comfort noise to let the user know the line is still connected.

Using SSML to fake a delay with noise

Inserting sounds inside of a response is easy using the method in SSML. Google supports several properties for when to start and stop the clip. I find the repeatDur property the easiest to use since it won’t exceed the value you enter and will repeat the audio clip if your clip happens to be shorter.

Finding filler sounds

You will also need to find some sound files or record your own. You could record some noises at an agent desk or you could look for some recordings online. Google actually has a filterable library of sounds you can find, listen to, and link to here: https://developers.google.com/assistant/tools/sound-library
Amazon has a massive sound library for Alexa Skills, but their terms limit use to Alexa apps.
There is also a free sound library on YouTube. Make sure to read their usage terms.

Let's start with a typing noise, which isn't uncommon to hear when calling an agent:

Your browser does not support the audio tag.

Note it doesn't sound very loud and shouldn't, so you may need to turn your speakers up.

Example bot

As a simple example, let’s make a bot that does simple multiplication. I called it math.multiply with some training phrases:

With 2 parameters:

Fulfillment example

Now let’s show how we can make our agent add a pause with some background noise. My complete fulfilment code looks like this:

const functions = require('firebase-functions');
const {WebhookClient} = require('dialogflow-fulfillment');
 
process.env.DEBUG = 'dialogflow:debug'; // enables lib debugging statements
 
exports.dialogflowFirebaseFulfillment = functions.https.onRequest((request, response) => {
  const agent = new WebhookClient({ request, response });
  console.log('Dialogflow Request headers: ' + JSON.stringify(request.headers));
  console.log('Dialogflow Request body: ' + JSON.stringify(request.body));

  function multiply(agent) {
    const number1 = agent.parameters.number1;
    const number2 = agent.parameters.number2;
    const answer = number1 * number2;
    
    const max = 5;
    const min = 2;
    const duration = Math.random() * (max - min) + min;
    
    agent.add(`
        ${number1} multiplied by ${number2}. 
         
           keyboard typing
         
         That comes to ${answer}.
       `);
  }
  
  // Run the proper function handler based on the matched Dialogflow intent name
  let intentMap = new Map();
  intentMap.set('math.multiply', multiply);
  agent.handleRequest(intentMap);
});

This is mostly Dialogflow’s standard fulfillment example code except for the multiply function. You can see here I just set a random time between a min and max to pause for 2 to 5 seconds.

Hear some filler noise

You can try this below without dialing in.
Just ask it something like:

what is 431 times 31234?

Make sure to hit the audio button to hear the response.

Ambient background noise

Even if you aren’t running an ASMR phone service, ambient background noise can give your phonebot a distinctive, life-like experience.

Why ambient noise?

If you were calling a busy coffee shop, if a live person picked up (assuming they don’t have a fancy noise cancelling microphone) you would hear some background noise. So long as the background noise isn’t too loud or distracting, this ambience can make the experience seem more authentic.

Finding Background Noises

Dialogflow actually includes a number of these ambiances in the Google Assistant sound library: https://developers.google.com/assistant/tools/sound-library/ambiences

Of course if you can always record one yourself in a real environment.

Adding Continuous Background Noise

As we showed earlier, adding noise within a prompt is simple, but how do you add in ambient background noises?

Dialogflow does not really give a way to do this out of the box. It plays a prompt and waits for a response. Dialogflow won’t generate noise while it is waiting. The only option is to do this from your telephony platform or RTC-Bot Gateway. Usually this will look something like this:

Using conferencing to mix in ambient audio

You need some kind of mixer that will combine the Dialogflow audio stream with the ambient noise. Often this can be implemented via a conference bridge. Dialogflow’s speech to text is pretty good at ignoring background noise, but it is always best to send it a clean signal if possible so send the media direct from the caller to Dialogflow without the ambient noise mixed in if you can.

Dialogflow’s Phone Gateway will not do this by itself, but some of the third party options let you do mixing/conferencing in their platform. If you are using the call forwarding approach, you can always mix the audio in a conference bridge before forwarding the call to the Dialogflow Phone Gateway.

I added the coffee shop noise to the Math-bot above as an example.
Here is the background noise:

Your browser does not support the audio tag.

And here is a recording of a call where I ask it some questions:

Your browser does not support the audio tag.

I happened to use Voximplant to build this example. You can see the code for it here.

Make Some Noise

Noise isn’t always bad. Used effectively, it will actually make your calls seem less like a bot and more like a human on the other end. Used creatively, it can help enhance a businesses brand and make the phone experience unique.

Let me know how you make use of various noises in your phonebots in the comments below.

About the Author

Chad Hart is an analyst and consultant with cwh.consulting, a product management, marketing, and strategy advisory helping to advance the communications industry. In addition, recently he co-authored a study on AI in RTC and organizes events / YouTube series covering that topic.

Remember to subscribe for new post notifications and follow @cogintai.

Build a Conversational IVR with Amazon’s Lex and Connect in 45 minutes

Binoy Chemmagate — Fri, 03 Jan 2020 11:59:00 GMT

Voicebot usage with smart phones and smart speakers is growing like a storm. Bots in general are making their way to more customer-support use cases, but it is still pretty rare to hear a voicebot on a customer support phone call. Amazon is one of the major tech giants looking to change this with a combination of some of their recent capabilities.

In this blog, I will show you how to build a voicebot for your business using a combination of Amazon's bot and contact center services - Amazon Lex and Amazon Connect.

Introduction to Amazon Lex and Connect

Lex is Amazon’s Natural Language Understanding (NLU) service for bots. It includes an Automatic Speech Recognition (ASR) - aka Speech-to-Text (STT) - option so it can handle voice in addition to text. Since it is an AWS service, it incorporates Amazon’s usual scalability and pay-as-you-go model that developers and developers of all sizes have come to depend on.

Amazon Connect is a Contact Center as a Service (CCaaS) that can be set up in a few minutes. Rather than building a call center from scratch or purchasing expensive software, Amazon Connect provides tools to make provisioning a call center easy. Connect includes a phone number for incoming and outgoing calls with tools for provisioning agents quickly from a dashboard.

Taken in combination, AWS and Lex can be used to build a Conversational IVR. The advantage of AWS combo over the existing solutions is that you can connect your Lex bot to the dial-in number without a 3rd party service. This reduces many complexities such as setting up a telephony interface between the dial-in number and the bot, reduces latency (given the Lex bot and Amazon Connect are in the same AWS region), making changes and publishing is faster.

Architecture

When you are building a conversational IVR voicebot, you need 3 main components.

Bot engine
Telephony infrastructure - to handle calls
RTC-bot gateway - to connect the bot to the telephony system

Amazon Lex acts as the voicebot in our case. Amazon Connect provides the dial-in number for the voicebot. Fortunately, we don’t have to look further for a RTC-bot gateway since Amazon Connect can connect directly to the Amazon Lex based voicebots. You can check this documentation for more details.

Pricing

The pricing is split between Lex and Connect since you need both services:

Amazon Lex	Price in USD*
Speech request	$0.004 / request
Text request	$0.00075 / request
Amazon Connect
Phone number (U.S. East)	$0.03 per day
Inbound usage (U.S. East)	$0.0022 per minute

*Amazon Connect and Lex pricing snapshot as of 2019 Dec 30th.

Let’s say you have a smaller call center with 10,000 calls a month, with an average call duration of 3 minutes. Assuming a split of 50%, 40%, 10% for the bot, customer and silence respectivley in each call, we can calculate the total cost for Amazon Lex and Connect. For each call, on average we are sending 5 speech requests to Lex.

Billable item	Unit Price	Unit	Units	Total
Lex speech requests	$0.0040	/request	50,000	200.00
Amazon Connect - Phone number	$0.0300	/day	20	0.60
Connect - Inbound usage	$0.0022	/minute	30,000	66.00
-------------------------------------------	--------------	------------	----------	------------
Total	$0.0089	/minute	30,000	$266.60

Your total cost is around $267, given you are not using any monitoring or storage functionality. That comes to less than a tenth of a cent, much less than competitive voicebot telephony options.

How to Guide

Creating a Voicebot using Lex

Assuming you have an AWS (Amazon Web Services) account already, all you have to do is to click on Amazon Lex service. If this is your first bot then choose Get Started, otherwise, click create. Lex offers few templates for building bots, I will choose Custom Bot so you can build any bot you wish.

I have named the bot RestaurantBot since this is a bot intended to make reservations for a restaurant. The default language is US English and you can choose male or female voices and type sample texts to hear the voice samples. The session time out defines how long you would like the bot to keep the context of the conversation when the customer goes silent. Here I have chosen 1 minute as that’s kind of ideal for a restaurant reservation. If you would like to measure the sentiment of the customer conversations, then click Yes. The IAM role is automatically created and you can opt for COPPA based on your preference.

Click on Create and the first screen is about creating an intent.

Bot terminology

Before we go too far, let’s review some terminology. When you are building/training with bots using Dialogflow, Amazon Lex or any other virtual assistants, you need to be familiar with some basic Natural Language Processing (NLP) concepts:

Intents - The intention or goal of the customer. A customer booking a table can be an intent e.g. BookMyTable
Utterances - These are the spoken phrases by the customer to invoke an intent. This could be any phrase for reservation e.g. Can I have a table for two? I am looking for a dinner reservation, etc.
Slots - The essential information needed for the voicebot to fulfill a customer request. This could be the date or time or the number of people, etc.
Fulfillment - This is the action you want the voicebot to perform when the essential information is available. This can be two things, a response to the customer saying the reservation was successful or failure and a message to the restaurant for booking the slot.

There is plenty of documentation out there that explains these in more detail.

Creating an Intent

Click on Create intent and give a unique and identifiable intention/goal name. Here I have chosen BookMyTable.

Click Add and that's it. Good job - you have created your first intent!

Now you will add a sample utterance for the bot so it can relate the phrases to the intent you just created.

Adding sample utterances

Type in the common spoken phrases you would use while making table reservations. I am using the following samples.

One advantage of using Lex is that you can also invoke a Lambda function for processing the customer inputs. I will talk about some use cases of using Lambda functions briefly in the last section. Now that you have trained the intent with some utterances, the next step is adding slots.

Adding slots

Slots make sure the bot has all the information needed from the customer to perform an action or lead to the fulfillment of a request. For making a restaurant reservation, you probably need 4 essential slots.

Name of the customer
Date of the reservation
Time of the reservation
Number of people

The customer might already give some of this information and you can add the slots within the Sample utterances to catch them early so that the bot need not ask those questions again.

We can make each slot name mandatory or optional depending on the bot you are building. The slot type will define the input type expected from the customer. The prompts are like utterances from the bot to receive customer input. I have made all the slots as mandatory since a restaurant would require all that information to confirm a reservation unless the customer has table preference or food allergies, etc.

We can add a confirmation prompt after adding the slots. This will repeat the reservation details and confirm with the customer. I am reading out the customer inputs using the slot names we defined earlier.

Adding fulfillment

Fulfillment lets you send the reservation details (if everything goes well) to your restaurant’s booking system by invoking Lambda functions or return the parameters to the customer. The Lambda function can come in handy when you want to check the reservation against the availability of tables or opening hours, etc. This bot can start another conversation with the customer with alternative options.

The conversation ends with adding a response and saving the intent. You can add your favorite greeting here.

Build, Test and Publish

Click on the “Build” button on the top right corner and this would show all the errors and these errors are self-explanatory so you can easily debug the errors. Once you confirm the build and if the build is successful, you will get a message shown below.

To test the bot, start by entering one of the spoken phrases you have added as utterances and continue the conversation.

Once you get the correct responses, you can now publish the bot by clicking Publish and this bot will be available for other AWS services to access. Do not forget to create an alias name for Amazon Connect to identify this bot among other bots you created. You have completed building a basic bot, great job!

Now let’s add a dial-in interface for it.

Adding a dial-in number for your bot with Connect

We will use Amazon Connect to add a phone number than can dial into the bot. Start by selecting the Amazon Connect service from AWS services and choose Get Started or Add Instance if you already have an Amazon Connect Instance. You can give a unique name for the instance and click Next Step as the default configurations would be good enough for our use case. You finish by clicking Create instance and it might take a while for the instance to be ready.

Adding your Lex bot to contact flows

Go back to your Amazon Connect service panel in AWS services (NOT the Amazon Connect instance URL) and click on the instance name and select the Contact flows. The contact flow defines the customer experience from start to end. We want the customer to talk to our bot instead of a real agent so we need to add the bot to the contact flow.

Select the Lex bot from the drop-down menu and click Add Lex Bot. Make sure you create the Lex bot and Amazon Connect in the same AWS regions to avoid delays.

Setup your Amazon Connect Instance

Login to your Amazon Connect instance (e.g. https://chemmagate.awsapps.com/connect/login) as Admin and click on Routing on the side panel and select Contact Flows. We are going to create a new contact flow for the bot as we want the customer input. Do not forget to give a name to your contact flow. Drag the Get customer input block under Interact onto the designer and connect it to the Entry point block.

Click on the Get Customer Input block and Choose Text-to-speech or chat text. You will add greetings to your customer here in this text box.

Choose Amazon Lex and select your bot from the list. Also, add the intent you created and save it. We can keep the alias as $LATEST as long as you do not have multiple versions of your bot.

You can add a block Disconnect/Hang up block so that your bot disconnects the call after it fulfils the intent. Press Publish to make this flow available to use (this is important).

Assigning a number to the contact flow

We will now claim a number for the contact flow we just created. Click on Routing on the side panel and select Phone Numbers. Choose Claim a number and choose the country and the number you would like to use. Select the contact flow you just created and click Save.

Call the number and ask “Can I have a table for Friday?” (or the utterances you created) and see how your bot responds. It might say “sorry, I could not understand, can you please repeat that”. After a few tries, it will eventually work 😀

Congratulations, you made a basic voicebot-based conversational IVR!

Recap & Conclusions

The whole process took me about 45 minutes as I had some previous experience with Amazon Connect so that part took the least amount of time in this setup. I had some hiccups while publishing the bot as it is very strict about the syntax and slot types. Of course this is a simple example and a real bot would need a lot more work.

Pros and Cons of the AWS approach

There are some advantages and limitations to the AWS approach:

Advantages

Dial-in number to voicebot works seamlessly so you don’t have to do any telephony interface tricks
Very easy to transfer the call to a human agent with Amazon Connect capabilities.
Lambda functions enhance your bot’s skills tremendously as it can invoke notifications such as SMS, email or integrate with 3rd party systems
Use DTMF or Amazon Lex for accepting input from the customer, unlike other systems which can only process one input type (DTMF or voice commands) at a time
Easily integrate Amazon Lex with Facebook apps, Kik, Slack or Twilio SMS

Limitations

Training utterances was more cumbersome than I expected - I had to provide many sample utterances to get the bot to pick up a variety of phrases for an intent.
Confidence percentages are not exposed in the Lex GUI, so it is difficult to tell when you need to do more training for an intent
There is no built-in small talk (like Dialogflow), so you need to train it to do everything
The bot can go in a loop of asking the same question and annoy the user sometimes, which is again the lack of training
Amazon Connect configurations can be complex for a telephony system beginner.

Scorecard

Looking at the scorecard Chad used in his previous voicebot IVR posts, Amazon Lex powered with Amazon Connect is a good fit for most voicebot IVR features.

Requirements	Amazon Lex with Connect
Call Transfer	Yes
Recording	Yes
Playback interruption	No - cannot be set from Lex console
No activity detection	Yes - session timeout
DTMF detection	Yes
SMS	Yes - with Lambda functions or channels

Amazon Connect offers typical dial-in telephony capabilities like transfer and record. When a feature isn't there, Amazon does provide some flexibility to extend functionaly with AWS Lambda functions with nice features like built in storage.

About the Author

Binoy Chemmagate is a product manager with 9+ years of experience in the ICT industry. He started his ICT career with Nokia as a standardization engineer and standardization bodies such as IETF and W3C have recognized his work on Web transport protocols. He has co-authored publications and patents in real-time communication and machine to machine communication fields. He has been an invited speaker at international WebRTC conferences around the world. In his free time, he is involved in product development coaching in the local startup ecosystem.

Remember to subscribe for new post notifications and follow @cogintai.

AudioCodes Voice.AI Gateway Review

Chad Hart — Tue, 10 Dec 2019 11:55:00 GMT

My last several posts have looked at the use of voice-enabled bots – aka voicebots – for use in Interactive Voice Response (IVR) and other telephony applications. This post is a review of another voicebot telephony connectivity option - the Voice.AI Gateway from AudioCodes. If you want to connect a voicebot to a phone network, then you need an RTC-Bot Gateway. AudioCodes was one of the first VoIP gateway infrastructure vendors more than 20 years ago and the Voice.AI gateway continues this gateway tradition. Instead of connecting disparate network technologies, this time they are connecting to various AI platforms for speech processing and bot interaction.

Voice.AI Gateway Approach

The AudioCodes Voice.AI gateway has some unique qualities compared to other RTC-bot gateway solutions on the market. You can see some of the differences in the marketing image from their product page below:

Diagram provided by AudioCodes. cogint.ai verified Azure Text-to-Speech, Google Text-to-Speech, Dialogflow, and AWS Polly in this review.

Notably, it is offered as a managed service that is designed to connect to a wide variety of existing telephony networks and interfaces with speech and bot APIs from various cloud providers.

Telephony Infrastructure

AudioCodes Voice.AI Gateway is based on its Session Border Controller (SBC) product. If you haven’t dealt with VoIP infrastructure that’s probably a new term. Session Border Controllers act as traffic controllers and transformers for VoIP traffic. Prior to some of the technologies that have now become standard with WebRTC, they also helped to deal with some of the firewall and NAT traversal issues and still do for most SIP networks. They help to mediate signaling differences, even within a standard protocol like SIP, and help with media conversion where needed. Since SBCs are essentially a gateway device, it is somewhat natural to expand on the number of things that can be connected to the gateway. This is what AudioCodes has done with AI-based speech and bot services.

The SBC-core of the Voice.AI gateway means it has many telephony connectivity options for SIP networks and some WebRTC-based devices. It can be used to connect with older Enterprise and Contact Center SIP infrastructure in addition to modern SIP trunks for inbound and outbound calling. I am not sure if this works in conjunction with the AI capabilities, but SBC typically have excellent high availability options, with the ability to keep a call connected even in the face of a catastrophic software failure.

Managed Service

Unlike all the other RTC-Bot gateways I have reviewed, the Voice.AI Gateway is a managed service. This means AudioCodes handles all the setup, configuration, and infrastructure maintenance. If you want to make an adjustment, you need to ask AudioCodes to do it. This could be good or bad depending on your preferences. If you aren’t sure what you are doing or don’t have the staff to maintain a gateway, AudioCodes makes things easy. However, if you are type of organization that likes to tinker and improve yourself, then you’ll have to live with interacting with another party for your project with minimal documentation and tools to see what is happening behind their blackbox.

AudioCodes prefers to run their managed service in Azure though they support AWS too. They say they can also deploy this in a customer’s own cloud environment, which may be required in sensitive applications where routing customer outside of the enterprise’s direct control is not allowed due to privacy concerns or regulatory requirements.

Multi-platform

The Voice.AI gateway connects to many difference speech and bot services. AudioCodes has stated this includes:

Speech to Text providers	Text to Speech providers	Bot engines
Azure Google Yandex	Azure Google Amazon Yandex	Azure Bot Service Dialogflow

Why so many services? Certainly, an existing Azure Bot Service environment would need expect support for Azure’s transcription and Text to Speech (TTS) synthesis services. The same is true for Dialogflow. The additional Text to Speech (TTS) providers can help to give a bot a more distinctive voice. Beyond that, it is also possible to mix and match STT and TTS engines. This might make the most sense to do if someone had a Dialogflow bot but wanted to use something like Azure’s Custom Voice service that lets developers design their own unique synthesized voice. I don’t expect we will see too many mixed-platform voicebots for reasons we will touch on in the review, but there are certainly scenarios where it could make sense.

Pricing

AudioCodes has 2 mutually exclusive pricing options for its Voice.AI Gateways – per minute and per concurrent-session. I describe the pricing they shared with me below.

Per-minute option

The first model is priced on a per-minute basis:

Minutes per Month	Price per Minute
10,000 to 500,000	$0.018
500,00 to 1,000,000	$0.015

AudioCodes has a 10,000 minutes / mo minimum charge and a 12 month commitment which comes to $180/mo and a total commitment of $2160 for a year.

Per-session option

They also have a per-session option based on based on a maximum number of concurrent sessions per configuration:

Sessions	Monthly price per session
5-10	$105
11-50	$89
51-100	$79
101-500	$71

To use this option, you need to commit to at least 12 months of service at those levels with a minimum of 5 sessions a month - $525/mo for a total $6300 1-year commitment.

You will also need to cover the cost of transcription and speech synthesis, as well as the bot on top of this. The math to figure out the cost for a given scenario us going to vary significantly based on the voicebot and user interacton behavior. A chatty bot will incure more TTS charges. A chatty customer will incure more STT. As a lower volume usage example, let’s assume our average voicebot interaction lasts 3 minutes, 40% of which is the customer, 50% is the bot, and 10% is silence and we will handle 10,000 calls a month. Using Azure’s Speech to Text (STT) and AWS Polly for Speech to Text (STT), this comes to 3.36 cents a minute. The SIP trunk connectivity will usually run a quarter cent to a cent a minute, but that puts the Voice.AI Gateway price well below Dialogflow’s Phone Gateway 5 cents per minute pricing for its Enterprise Essentials package.

Service Charge	Unit Price	Unit	Units	Total
AudioCodes (Less than 500,000)	$ 0.0180	/minute	30,000	540.00
Azure Speech-to-Text, standard	$ 0.0167	/minute	12,000	200.00
Dialogflow Text Charges	$ 0.0020	/minute	12,000	24.00
AWS Polly TTS, neural voice	$ 0.0115	/minute	15,000	172.94
---------------------------------------------	-------------	-----------	----------	----------
Total	$ 0.0312	/minute	30,000	$ 1,006.94

Product Review

Setup & Methodology

To provide a fair comparison with my other reviews, I decided to stick with the existing Dialogflow bot I used for the Voximplant and SignalWire posts. To connect to the bot, I exported the bot’s service account JSON from Google Cloud and sent it to the team at AudioCodes.

To show how I could control my own connectivity, I made a SIP trunk using Twilio’s Elastic SIP trunk service. AudioCodes gave me a SIP URI and I plugged this into the Original setup inside Twilio:

Then I assigned a phone number to this trunk. I shared that phone number along with Twilio’s published list of IP addresses for the AudioCodes team to whitelist. The next day they had it ready and I was able to verify the bot connected.

Since the Voice.AI Gateway is offered as a managed service, I don’t have much other setup or code to share beyond the few tidbits inside Dialogflow below since AudioCodes took care of everything else.

Speech to Text (STT)

The Voice.AI Gateway performs its own transcription with the chosen transcription engine before passing along that transcribed text to the bot. My final tests used Google Cloud Text-to-Speech.

One of my intents is “Talk to Chad” to initiate a call transfer. In my testing, I noticed with the transcription consistently would not capture my name properly. I never had this problem when Dialogflow did the transcription natively (vs. converting it to text first and sending it to Dialogflow that way was was the case here).

It is hard to tell if perhaps the Google STT was not performing well or not configured optimally. Dialogflow does some automatic optimizations to its own STT to help better match vocal utterances to intents with its Auto Speech Adaptation feature:

AudioCodes did say they just added an option to define words to help the transcription engine better match the input for Google’s Speech-to-Text service, like this:

{
    "sttContextPhrases": [
            "Chad"
        ],
    "sttContextBoost": 10
}

They say this could be included as an input parameter as I will cover in a bit.

Text to Speech (TTS)

Since the Voice.AI gateway supports voice generation I was eager to try some different voices. I decided to go with AWS Polly’s Emma for speech synthesis. Emma has an option for one of AWS’s higher-quality neural models. I also wanted to test synthesis in a different language variant than the bot was set to – en-GB in this case.

I verified SSML playback worked fine. Polly Lexicons for customizing pronunciation are not supported.

It would have been fun to experiment with Azure’s Custom Voice to make my own unique voice, but that was well beyond the scope of this review.

Call Transfer

The Voice.AI Gateway does not support Dialogflow’s Telephony Tab, but it is easy to initiate a call transfer by with a TransferCall string followed by the number surrounded in brackets as can be seen below.

This actually has the advantage that you can say set it to something immediately before the transfer right from the GUI without having to do anything fancy with follow-up intents or fulfillment.

Recording

The Voice.AI gateway does not record directly itself, but it does support the SIPREC standard for sending recordings to third party recording systems. SIPREC is a SIP-based specification that essentially forks calls and sends the media to an external recording system. This approach is very common in large enterprises and call centers who often have sophisticated recording systems for doing analysis and storage for regulatory compliance. I did not verify this feature, but it is common in SBCs so I doubt there would be any issues.

Playback Interruption

The Voice.AI gateway does have barge-in capabilities. In testing I had to tell the AudioCodes team how long I should let a prompt play by default before starting STT again, letting the user barge-in.

Events

Welcome

When a new call comes in, the Voice.AI gateway sends a VOICE_AI_WELCOME event. That can be used like the default WELCOME event or Dialogflow’s Phone Gateway TELEPHONY_WELCOME. This event also includes a parameter with the calling phone number that can be accessed with #VOICE_AI_WELCOME.caller.

Looking at the logs in Stackdriver, shows the called number and SIP URL are also included as shown in the textPayload object below:

textPayload: "Dialogflow Request : {"session":"95f08c1f-42ae-40f4-9ef0-f84e7c0c351b","query_input": {
    "event": { 
        "name": "VOICE_AI_WELCOME",
            "parameters": {
            "calleeHost": "voiceaisbcse.westeurope.cloudapp.azure.com",
            "callee": "+16172076328",
            "callerHost": "cogintai-audiocodes.pstn.twilio.com",
            "caller": "+16173145968"
            }
        }
     },
     "timezone":"America/New_York"}"

DTMF Detection

The Gateway will detect DTMF digit entry and send a DTMF event with a single DTMF digit in the digits parameter.

In the course of my discussions with AudioCodes, they mentioned they added some options to pass custom payload parameters to control various details. This includes setting the ASR engine to be continuous or to give it a timeout.

Control Parameters

I ran out of time to verify this and did not get any documentation, but the AudioCodes team said they added some the following parameters that can be passed with the intent:

speechLanguage
voiceName
continuousASR – True/False, to always listen for spoken input
continuousASRTimeoutInMS - to control barge in

These would need to be set in code as part of fulfillment or passed as a custom payload via the GUI. One can see how this would enable switching of voices, languages, and control of how and when you should let users speak over your bot’s speech output.

Synopsis

AudioCodes support for Dialogflow matured quite a bit in the couple of months I was interacting with them off-and-on for this post. It was apparent the system was originally architected around Azure. Azure Bot Service development is much more code oriented vs. Dialogflow’s more typical GUI-centric method. Of course one can control a Dialogflow bot without the GUI, but this becomes one other thing that needs to be maintained without the benefit of the GUI's guidance and tools. That is certainly not much work at all for an experience Bot developer but does mean more work for a Dialogflow novice that wants to keep things in the GUI as much as possible. Little things like hanging up when the call is done are still missing in the Voice.AI gateway, but I am sure will come soon.

Beyond that, the ability to mix platforms is nice but I wouldn’t recommend doing that if it can be avoided. Building a great voicebot is usually hard enough without having to worry about additional interaction dependencies. Perhaps there are some exceptions if a specific synthesized voice has been chosen as a corporate brand and the IVR team decides to use a different underlying bot platform or if a specific STT engine has already been tuned for one environment. A larger enterprise or call center may also have some premise-based transcription infrastructure in place they want to reuse. However, with few exceptions like Microsoft premise-based version of its bot framework, in most cases sensitive data will make its way to the cloud one way or another so it is better to organize around modern cloud best practices. Most of the time companies choose one provider – Azure, AWS, or Google and stick with them for everything. Voicebot development is certainly easier this way. Furthermore, the bots themselves are doing more to optimize for speech input to reduce latency (unless you do STT on device) and improve performance. Dialogflow and Amazon Lex bot accept direct audio input. Dialogflow includes speech output too. It seems AudioCodes is moving in the direction of using bot-platform integrated speech services for Dialogflow at least.

Where I see the Voice.AI gateway really shining is in more traditional call center environments. These often have a lot of premise-based equipment, perhaps with some enterprise-controlled elements in the cloud. This audience often would not have dedicated staff to manage an RTC-Bot gateway. They would prefer to pay to have a neck to wring if something breaks, so the managed service makes a lot of sense here. Similarly, recording with SIPREC is only practical if you have a SIPREC recorder already, but that is what that audience usually has already.

Scorecard

The Voice.AI ended up being a reasonable fit against the criterion I outlined in prior posts. Certainly, some of these requirements will matter more than others for your specific project.

Requirement	Voice.AI Gateway support
Call Transfer	Yes – simple short code method
Recording	Yes, but only with external SIPREC systems
Playback interruption	Yes – some options to set programmatically
No activity detection	No
DTMF detection	Yes
SMS	No

I have some other reviews planned in the near future. Make sure to subscribe so you don't miss anything!

EDITS
14-Jan, 2020: Updated pricing per AudioCodes guidances to include volume minimums without a managed service requirement in Pricing

About the Author

Chad Hart is an analyst and consultant with cwh.consulting, a product management, marketing, and strategy advisory helping to advance the communications industry. In addition, recently he co-authored a study on AI in RTC and helped to organize an event / YouTube series covering that topic.

Remember to subscribe for new post notifications and follow @cogintai.

Building a Multi-business Voicebot IVR

Chad Hart — Wed, 31 Jul 2019 01:02:00 GMT

In the previous 4 installments of our Building a Voicebot IVR with Dialogflow series, we validated we could make a decent phone-based voicebot interface for your average retail business. As we documented, setting up this system is no trivial matter. It is certainly not something a typical small business would do, so we set out on researching ways to make this technology more widely accessible. We devised a system for using Dialogflow to provide individual voicebots for different businesses using a templating approach.

In this post we will walk through how we arrived at our architecture, how we created a templating system for providing customized fulfillment, and we share a lot of our node.js code for reference. We’ll start by covering our approach and then dig into the implementation.

Architecture

Hierarchies for handling multiple businesses

We wanted to allow a non-technical business owner to set up their own voicebot IVR in minutes. There are many different kinds of businesses with a near infinite number of things a caller could ask. To reduce the scope of possibilities, we started with a few key assumptions. First, we assumed businesses in a similar category would have similar intents - i.e. restaurants will get asked about “menus”. Second, we assumed business in a given sub-category will have very similar same intents with individual fulfillment that is basically the same except for varying parameter values - i.e. “how big is your large pizza” will vary by pizza restaurant - say “10 inch” for Joe’s Pizza” and “12 inch for Alice’s Pizza”. This means we could start with a few business categories and setup a reasonable number of generic intents for each. These intents could be grouped into templates by business or sub-business type and we could template the responses.

We also wanted to make sure every business has its own unique phone number. As covered in the previous posts, we had a system for that, so we just needed to figure how to tie that phone number to what appears to be a unique agent, applying the unique responses for each business at the right time. For that, we considered 2 options:

One agent per category
One agent per business

One agent per category

Dialogflow is not really designed to handle multiple applications within the same bot, but as long as your intents are the same you can use different fulfillment to provide unique responses for each individual business. This approach would let us use one Dialogflow agent per business category - i.e pizza shop, hairdresser, dentist, etc..

Using the contexts discussed in the previous post with VoxImplant’s VoxEngine API’s, we were able to pass the called phone number as a unique identifier for each business through to Dialogflow’s fulfillment. Our fulfillment service would then use the phone number to look up the appropriate responses for that business.

One Agent per Business

The other approach is to use Dialogflow more for how it was intended - one agent per bot with each having its own intents and fulfillment. In this case, we do not need to rely so heavily on fulfillment, since much of each business's individual responses could be statically stored with the intent. Fulfillment still needs to be used in some cases where a static response is not practical, but this is far less than doing it for every intent like we had to in the the one agent per template approach. Responses are faster when you can define them statically in the bot since you can skip the fulfillment step.

Adding flexibility, at a cost

We chose the one agent per business approach, but it wasn’t without some cost. The one agent per template approach works ok if you have a simple bot without much variation between businesses. However, we came up with many examples where we might want to have business-specific intents. While it is possible to use contexts to filter these on a per-business basis, this would lead to a bloated bot with many intents. The Dialogflow GUI is set up for the one agent per business model, so using that allows for easy tweaking without always having to involve communicating with a database in our back-end.

Approach	Advantages	Disadvantages
One Agent per App	Simpler architecture Standardized agents are easier to debug and update globally	Does not scale well Difficult to customize agents Dialogflow GUI not designed for this
One Agent per Business	Fast responses since fulfillment is not always needed Customizable intents Easier to debug	~~Can't create agents programmatically~~ Added layer of creating agents from templates Costly to handle agent API keys

Programmatically creating agents

Since we chose to use one agent per business, we wanted a way to programmatically create the agent from a template. This was not possible when we originally started putting together this summary. It was only possible to create an agent manually through the Dialogflow web console. Our solution was just to manually create a bunch of Dialogflow agents and keep track of them as available or not available in our database.

Whenever you create a new Agent, a new project is created in the Google Cloud (GCP) account you used to authenticate against Dialogflow. Google forces you to create one project per agent, so there is no sharing of keys. This forced us to create extra logic to track the Dialogflow API keys for each agent. Manually copying over this key information was not pleasant 🙁.

Of course, Google added an API to create an agent right after we finished our research. We did not try this, but it would appear to address much of the bot creation pains mentioned above.

Implementation

Static vs. dynamic responses

Imagine you are a business owner. If someone asks:

When are you open?

Without context, you could respond with an all-encompassing response like:

“We are open from 5 PM to midnight, Monday to Thursday. On Friday and Saturday we open at 10 am and close at 2 am. On Sunday we are open from 10 am to midnight”

This is actually easy to fit into a static response, built into the intent. However, if someone asks something more specific like:

How late are you open tomorrow night?

You could respond with the long phrase above, but that’s a long response with way more information than was requested - not a great experience. Assuming it was Saturday, it would be better to respond with just:

tomorrow we are open to midnight

In this case we can’t just use a static intent without sounding robotic. Fulfilment is needed here to:

Figure out what day it was an add 1 to it for “tomorrow”
Look up the hours and return the appropriate response

It is easy to get carried away by getting more specific, so we sought to find a balance between intents that were too broad and too specific. In this example, we could simplify the intent to cover any questions about “hours for tomorrow”, we could probably get away with broadening that to:

On Sunday we are open from 10 am to midnight

Using a methodology like this we mapped intents to a set of fixed and dynamic responses. The next step was to stick this all in a database that could be used for our agent templates (for static responses) and fulfillment (for dynamic ones).

Data Model

This is just an example that shows our journey in this topic. We used node.js with MongoDB using mongoose, lodash and Express but you could use any similar stack you prefer. Note intermediate to advanced node skills are assumed for most of the rest of this post and we will skip a lot of detail to keep focus on the Dialoflow specifics. You can see the full code in our repo.

Business Information

Before we start digging into some code, lets review the restaurant data models we setup. Let’s start with our restaurant model that keeps track of all the business specific details. Below is the database entry:

This is also gives you an idea of our data structure. Pay special attention to the agent and openHours fields, we will be using those moving forward: openHours will be used as an example on how to respond back to Dialogflow’s fulfilment request. agent tells us the Dialogflow agent that is tied to this business.

Dialogflow agent tracking

This is an agent entry example:

You can see here how we keep track of the agent credentials so we can individually update them (you should have them in a separate config file but we are trying to show everything as straightforward as possible), the project name, and if the agent is already assigned to a business or not. When a new user finishes filling the data for their business, you just need to go through this collection checking for which agents are available, mark them as available = false and assign them to the corresponding restaurant document for the user (agent field on the restaurant model).

Agent template

The last database entry we would like to show is what we called agent template. This document keeps the format of each answer we need to create in order to reply back to a fulfillment request. Let’s just focus on hoursOfOperation as we mentioned above:

Ok, now that you have a rough idea of the contents of the DB, let's get to the the Dialogflow setup and fulfillment code.

Dialogflow Setup

First you need to get into Dialogflow and create an agent.

Make your Intent

After that we will include a sample intent for an intent named Hours Of Operation.

Make sure you properly set the entity type you will be getting out of each question in the Action and parameters section.

As shown above in the agent template model, each intent property has a nested structure. At the first child level you will see date and date-time. This corresponds to the parameter we get in the Dialogflow fulfillment request body according to how the user asked the question.

Intent JSON responses

Our code needs to understand what exactly does “now” or “tomorrow” mean. If you set the parameters above Dialogflow should automatically extract values for these. The JSON below shows how Dialogflow flow codes “Are you open now?” into tangible information we can use to compute the answer:

{  
   "responseId":"9eb99604-0303-4d64-b757-8d4d125e11ce-2dd8e723",
   "queryResult":{  
      "queryText":"are you open now?",
      "parameters":{  
         "date-time":{  
            "date_time":"2019-06-09T23:42:17-03:00"
         },
         "date":""
      },
      "allRequiredParamsPresent":true,
      "intent":{  
         "name":"projects/my-demo-agent/agent/intents/30fe78ed-9d2d-4156-acfb-10c174db0417",
         "displayName":"Hours of operation"
      }
   },
   "session":"projects/my-demo-agent/agent/sessions/f2698a2d-c8ca-1029-9dc0-c6a41a3a5352"
}

Here we got a date-time parameter. If the question was “are you open tomorrow” we get a date parameter instead since no specific hour is provided:

{  
   "responseId":"2a1e9fdd-e5f4-4597-be68-73b7e16b1d29-2dd8e723",
   "queryResult":{  
      "queryText":"are you open tomorrow?",
      "parameters":{  
         "date":"2019-06-10T12:00:00-03:00",
         "date-time":""
      },
      "allRequiredParamsPresent":true,
  }
}

Fulfillment

Lets see how we can set up an application that can provide an answer to each of these requests using the data we gathered from our customer. First of all, make sure you set up your intent to use fulfillment. As you can see from this image we are tunneling our application using ngrok*, so Dialogflow can hit our HTTP server:

*If you never heard about ngrok, it's a great tool for making your local web servers accessible over the Internet that you will definitely love.

Fulfillment code

For our example, let’s try to make our bot answer the question:

Are you open tomorrow?

This is the piece of code that will handle the incoming request from Dialogflow.
Then, at a high level we just build our answer and reply back.

const {WebhookClient} = require('dialogflow-fulfillment');

app.post('/fulfillment', (req, res) => {
  const agent = new WebhookClient({ request: req, response: res });

  buildIntentAnswer(agent)
  	.then(respondToFulfilment(agent));
});

const buildIntentAnswer = data => getFormattedData(data).then(getAnswer);

const respondToFulfilment = _.curry((agent, response) => {
	let intentMap = new Map();
  	intentMap.set(agent.intent, (agent) => agent.add(response));
  	return agent.handleRequest(intentMap);
});

If you want to know more about the WebhookClient and handleRequest functions, I strongly suggest you to check out Dialogflow’s docs on those.

The most important function of this chain is buildIntentAnswer which we will cover deeply up next.

getFormattedData - gathering restaurant information

To get an answer for our question, the first thing we need to do is to gather some relevant business information. We need to know if the restaurant is actually open or closed at the moment the user asked. In order to do that, we need to query our database and find the restaurant hours of operation.

const getProjectName = data => data.session.split('/')[1];

const getFormattedData = (data) => {
	const agent = getProjectName(data);
	const agentType = 'restaurant';

	return new Promise((resolve, reject) => {
		getRelevantBusinessInformation(agent, agentType, data.intent).then((businessInfoForIntent) => {
			const fData = hoursOfOperation.getFormattedData(data, businessInfoForIntent);
			fData.agent = getProjectName(data);
			fData.agentType = agentType;
			resolve(fData);
		});
	});
}

const getRelevantBusinessInformation = (agent, agentType, intentName) => {
	return new Promise((resolve, reject) => {
		Agent.findOne({'name': agent}).then((agent) => {
			models[agentType].findOne({agent: agent._id}).then((restaurant) => {
				if (intentName === 'Hours of operation')
					return resolve(restaurant.operation.openHours.toObject());
				else
					throw new Error('Intent not yet supported');

			});
		});
	});
}

getFormattedData cares about obtaining that business information and shaping it in a way we can use to compute our answer.

getRelevantBusinessInformation matches the agent in use (we got the name in the incoming request object), with the particular restaurant that is assigned to it. Once we have a matching restaurant it just returns the relevant information for our intent, in this case openHours.

Once we got the openHours object, we need to use that information to create meaningful string values that will be used in the answer. A tangible example is that you need to translate the openingHour time from 12 in the database to “twelve o ‘ clock” (hourToText function).

For that, we include an hoursOfOperation intent module, which encapsulates all of this logic. Think that each intent will need a complete different set of rules. These individual per-intent modules serve that purpose.

hoursOfOperation.js

const DAYS_OF_THE_WEEK = ['sunday', 'monday', 'tuesday', 'wednesday', 'thursday', 'friday', 'saturday'];
const converter = require('number-to-words');

hourToText = (hour) => {
	const strHour = hour.toString();
	const lastTwo = strHour.slice(-2);
	const remaining = strHour.substring(0, strHour.length - 2);
	const txtMins = lastTwo === '00' ? 'o clock' : converter.toWords(lastTwo);
	const txthours = converter.toWords(remaining);
	return `${txthours} ${txtMins}`
};

const dateHandler = (data, operationData) => {
	let date = data.parameters.date;
	date = new Date(date);
	const day = DAYS_OF_THE_WEEK[date.getDay()];
	const dayInfo = operationData[day];
	dayInfo.day = day;
	
	return {
		intent: data.intent,
		businessData: dayInfo,
		processedParams: {
			type: 'date',
			day
		},
		txtVars: {
			day,
			openHour: hourToText(dayInfo.openHour),
			closeHour: hourToText(dayInfo.closeHour)
		}
	};
}

const getHandler = (paramType) => {
	const mapping = {
		'date': dateHandler,
		'date-time': dateTimeHandler,
		'time': timeHandler
	};
	return mapping[paramType] ? mapping[paramType] : defaultHandler;
}

function getRelevantParam(params) {
	for (let param in params) {
		if (params[param]) return param;
	}
	throw new Error('No relevant param found');
}

exports.getFormattedData = (data, operationData) => {
	const handler = getHandler(getRelevantParam(data.parameters));
	return handler(data, operationData);
};

exports.getState = (data) => {
	if (data.processedParams.type === 'date-time') {
		if (!data.businessData.open) return 'closed';

		const openHour = data.businessData.openHour;
		const closeHour = data.businessData.closeHour;
		const hour = data.processedParams.hour;

		//The restaurant closes before the day ends
		if (openHour < closeHour) {
			if (hour > openHour && hour < closeHour) return 'open';
			else return 'closed';
		} else { //The restaurant closes the day after
			if (hour > openHour) return 'open';
			else return 'closed'
		}
	}

	if (data.processedParams.type === 'date') {
		return data.businessData.open ? 'open' : 'closed';
	}
};

getAnswer - building the answer

We just finished gathering all the information we need, now we need to build the answer. We need the actual explicit sentence we want to play over the phone. We will make use of getState in our hoursOfOperation module, which will let us know if our restaurant is open or closed, comparing the information provided in the question and the actual restaurant opening hours. Note that this module contains logic to handle different types of parameters. We just included a dateHandler in this example, but you may need one for each type of parameter you could potentially get.

const getAnswer = (relevantData) => {
	return new Promise((resolve, reject) => {
		const answerVariant = computeAnswerVariant(relevantData);
		AgentTemplate.findOne({type: relevantData.agentType}).then((template) => {
			const possibleAnswers = template.getAnswers(relevantData.intent, answerVariant, relevantData.txtVars);
			const answer = _.sample(possibleAnswers);
			return resolve(answer);
		});
	});
}


const computeAnswerVariant = (data) => {
	return {
		state: hoursOfOperation.getState(data),
		paramType: data.processedParams.type
	};	
}

The template model gets all possible answers (if we don't want to repeat the same sentence over and over again we provide different variations of the same answer). Lodash's _.sample method simply picks a random one of the array of answers.

This is a piece of the AgentTemplate mongoose schema that produces the text sentences:

const getTemplatePropertyFromIntentName = (intentName) => {
  const mapping = {
    'Hours of operation': 'HoursOfOperation',
    'Location': 'Location'
  };

  return mapping[intentName];
}

agentTemplateSchema.methods.getAnswers = (intentName, answerVariant, params) => {
  const replaceVars = (text, params) => {
    for (let prop in params) text = text.replace(`{{${prop}}}`, params[prop]);
    return text;
  }

  const templateProp = getTemplatePropertyFromIntentName(intentName);

  let answers;

  try {
    answers = this.intents[templateProp][answerVariant.paramType][answerVariant.state];
  } catch(e) {
    answers = this.intents[templateProp][answerVariant];
  }
  
  return answers.map(ans => replaceVars(ans, params));
};

const AgentTemplate = mongoose.model('AgentTemplate', agentTemplateSchema);

module.exports = AgentTemplate;

As you can see, getAnswers method in the model just accesses the proper child in the document (date in this case) and then, replaces all the values surrounded by curly braces with the proper value.

That covers our brief code review. We encourage you to clone the repo and run it locally to see it working.

Everything that has a beginning has an end

This is the last post in our Building a Voicebot IVR with Dialogflow series. Our plan to summarize our research in a single post early in May morphed into a series covering methods for adding phone connectivity to Dialogflow, reviews of the Dialogflow connectors provided by SignalWire and VoxImplant, how to add SMS to your IVR and this last one showing how we pulled it all together. Next week Chad will be giving a presentation summarizing this series during his Kill Your IVR with a Voicebot talk at ClueCon in Chicago on August 6th.

Now it’s your turn! We are always happy to share notes on projects and talk about our work. Make sure to leave some comments below so we can connect.

About the Authors

Chad Hart is an analyst and consultant with cwh.consulting, a product management, marketing, and strategy advisory helping to advance the communications industry. In addition, he recently co-authored a study on AI in RTC and helped to organize an event / YouTube series covering that topic.

Emiliano Pelliccioni is a computer engineer working at webRTC.ventures and specializes in developing real time communication applications for clients around the globe. One of his projects includes developing bot integrations for a major CPaaS provider.

Remember to subscribe for new post notifications and follow @cogintai.

Adding SMS to your Voicebot IVR

Chad Hart — Tue, 23 Jul 2019 10:27:11 GMT

Earlier in this series on using Dialogflow as an Interactive Voice Response (IVR) replacement, we established SMS interaction is a nice to have feature. Users are used to texting and often times it is a better experience to just send the user a link. One might think that handling SMS with Dialogflow is simple, and it is, but many complexities come into play when you try to blend voice and SMS in the same bot. In this post we will show our findings on ways to implement this feature.

We will describe the use case some more, walk through the architecture we landed on for handling SMS and voice from the same phone number, and discuss using Dialogflow’s Contexts as a filter to select channels.

Use case: why SMS and Voice at the same time

We used a restaurant as a reference business for our prototype. Imagine a caller on the move, looking for something to eat. They find the restaurant and want to know what kind of food they have. Maybe they do a web search and can’t find the menu, or maybe they just call in the first place - either because they have other questions or because they are not in a position to stare at a several inch piece of glass to see this info at the moment they were interested.

They first thing they ask is, “what is your menu?” Here the voicebot IVR could read off the menu, but that would take forever assuming there are more than a handful of menu options. The bot could also structure the menu into categories - say appetizers, salads, main courses, etc. and let the caller select just the category, but that leads you back into a traditional IVR where you need to deal with navigating hierarchies. In a lot of cases it is just easier to ask the user if they would like a link to the menu they can read themselves at their leisure.

So the IVR could read off the menu URL. However, reading off a URL like https://myrandomlocalrestaurant.com/summermenu on a call is never a great experience. After listening to what is often a long, un-punctuated string, the user needs to either remember the URL or write it down - which they may not be in a good position to do if their phone is held up to their ear. At some point they still need to enter the URL into a browser, hopefully without making a mistake. There are a lot of ways this interaction could go wrong, leading to a lost opportunity for the restaurant.

Wouldn’t it just be easier to send the caller something they can click on in the first place? In most cases the caller is calling from a mobile device that is SMS capable. You should be able to get their phone number from the incoming caller ID. Why not just send them the link via SMS?

Then, once you send them an SMS link, you could continue to engage the customer on that channel. This has the advantage of providing a second, asynchronous channel for on-going communication with the customer.

Now that we have established why a business would want to have a multi-channel bot that can handle both voice and SMS, next let’s look at how to implement that.

Architecture: Telephony + SMS in Dialogflow

What we really want from a Dialogflow developer perspective is something that looks like this:

This is the gateway scenario we were looking for, couldn’t find, and had to create

The gateway would handle the incoming interactions, tell if it was from the phone or SMS, and just send that on to Dialogflow.

If that ideal gateway were available this blog series probably wouldn’t exist. As we discussed earlier in the series, you can use Twilio or Signalwire to make a good SMS bot but neither of those platforms fit our gateway requirements like VoxImplant. So we could do something more like this where we effectively have a different gateway for each channel:

We split the Gateway into 2 channels - one for SMS and one for Voice

A user could call in to the Telephony-CPaaS and when needed, you could have your bot controller app signal the SMS-CPaaS to send the message and continue as a SMS-bot as needed from there.

The problem with this approach is that each channel has its own phone number. This might be ok in some scenarios, but in most cases you want to minimize confusion for the customer and stick to a single number. If you want a single number then you need to have some kind of additional proxy - or just forward calls/messages from one CPaaS to the other.

To reuse the same phone number, we ended up using SignalWire (CPaaS 1) and forwarding voice calls to VoxImplant (CPaaS 2).

We had some SignalWire logic from another unrelated project for phone number selection that we wanted to reuse so we ended up using SignalWire numbers and forwarding calls to VoxImplant via SIP. This setup involves a few clicks to enable inbound SIP inside VoxImplant and a 6-line script inside SignalWire:



    
        sip:{{To}}@dialogflow-7sqrp2g.cogintai.voximplant.com

SignalWire LAML script used to forward an incoming call to VoxImplant via SIP. This also records the call.

Using Dialogflow contexts to manage responses by channel

The next challenge we faced was dilenating between incoming telephony and SMS messages at the bot.

Different Intent responses for different channels

If the bot was very simple, you might be able to reuse the exact same intent responses across voice and text channels. We found for many intents, it gives a better experience to differentiate the responses by channel. For example, if the user asks:

how can I find you”

The voicebot can respond with a verbal description:

We are located at 4 Thompsons Point, building number 106 in Portland Maine just of the Fore River Parkway, Exit 5 or 5A on Route 295. Just a five-minute walk from the Portland Transportation Center, serviced by Amtrak and Concord Bus.

While it makes more sense to just send an SMS user a map link they can use to navigate themselves:

We are at 4 Thompsons Point #106, Portland, ME 04102. Here is a link to our address for driving or public transportation: https://goo.gl/maps/2Zc7hc6QdiSfHrjV7

Contexts as Intent filters

Contexts let you store stateful information across intents during a user's bot interaction. In Dialogflow, you can set a context at any time. Contexts can also store data, but in our case we many just needed the context as a filter.

As discussed in the VoxImplant 2019 review, we chose VoxImplant because they allow you to set contexts as part of their runtime scripting:

phoneContext = {
  name: "phone",
  lifespanCount: 99,
  parameters: {
    caller_id: call.callerid(),
    called_number: call.number()
  }
}

dialogflow.setQueryParameters({contexts: [phoneContext]})

We use the code above to tell our Dialogflow bot that the incoming call is a phone call. Then we simply duplicated the relevant intents and added a phone contexts to the duplicates:

Permissions

How do you know if the user is on mobile and can receive an SMS? There are a number of different phone number lookup services where you can send the user’s caller ID and it will give you a good idea if it a mobile. However, due to number porting these are not 100% reliable (at least in the US where any number is technically SMS capable). We also decided it might make sense to ask the user for permission while verifying the can recieve texts. As a result, we ended up assigning 4 different context levels:

Phone - starting default for incoming telephone calls
SMS_capable - for telephony callers we think are on a mobile phone, but we haven’t confirmed
SMS_authorized - telephony callers that gave explicit confirmation they can receive text messages on their caller ID (or they gave another number)
No context - used for SMS interactions and as default

Follow-up intents can be used to move a user from SMS_capable to SMS_authorized. We used fulfillment to actually send the text message.

This can get messy

Duplicating intents with several different contexts like this does clutter the Dialogflow GUI. If we just had two levels perhaps we could have reused the Telephony Tab to manage a different set of responses within a single intent; we had 4 and there is no way to make custom tabs today. Unfortunately there is not really an alternative to this other than not using the GUI and programming everything via API - a topic we will cover later in this series.

What’s Next

We are getting close to the end of our series. Next we will explore some of the difficulties in reusing a Dialogflow bot across multiple unique businesses. Also, Chad will be discussing many of these findings during his Kill Your IVR with a Voicebot talk at ClueCon in Chicago on August 6th. Make sure to subscribe so you don’t miss our posts and leave your comments below.

About the Authors

Chad Hart is an analyst and consultant with cwh.consulting, a product management, marketing, and strategy advisory helping to advance the communications industry. In addition, recently he co-authored a study on AI in RTC and helped to organize an event / YouTube series covering that topic.

Remember to subscribe for new post notifications and follow @cogintai.

Voximplant's Dialogflow Connector: 2019 Review

Chad Hart — Tue, 25 Jun 2019 10:38:00 GMT

In this third installment of our Building a Voicebot IVR with Dialogflow series, we take a look at Voximplant’s Dialogflow Connector. Voximplant was the first CPaaS vendor to have a direct telephony integration with Dialogflow. Alexey Aylarov gave an introduction how they build this gateway here soon after it was first released. Emiliano and I ended up using Voximpant’s Dialogflow Connector for our project, so we will go much deeper into how Voximplant matched up to our requirements and provide some code samples in this post.

VoxEngine Approach

Serverless Execution

Unlike the XML-script based approach used by many platforms, Voximplant has a JavaScript-based execution environment it calls VoxEngine. While not terribly complex, it does require some JavaScript skills if you want to do more than copy and paste example code you may find. This all runs serverless so you don’t have do setup your own development environment and worry about servers. It is quick to get going if you just need to manage a few lines of code. The downside is you need to get used to their web-based Integrated Development Environment (IDE) and debugging tools or figure out how to use thier APIs to synch your JavaScript files. You will likely start to miss your preferred IDE as your VoxEngine scripts grow in complexity. Also, make sure to remember to save often since the web interface won’t do that for you automatically!

Dialogflow API wrappers

Voximplant provides a lot of control over its interaction with Dialogflow. It has effectively wrapped most of the Dialogflow API’s for interacting with an existing agent (but not the ones for programming an agent).

This API includes a method to start a Dialogflow session and a number of events and classes:

Classes	Events
DialogflowError DialogflowPlaybackFinished DialogflowPlaybackMarkerReached DialogflowPlaybackStarted DialogflowResponse DialogflowStopped	DialogflowEventInput DialogflowInstance DialogflowOutputAudioConfig DialogflowQueryInput DialogflowQueryParameters DialogflowResponse DialogflowResult DialogflowSettings DialogflowStreamingRecognitionResult DialogflowSynthesizeSpeechConfig DialogflowTextInput DialogflowVoiceSelectionParams

I will review some of these events and classes in the making our voicebot section.

Pricing

Voximplant charges $0.005 / minute for inbound traffic to VoxEngine and $0.015/minute for PSTN calls out of VoxEngine. This gives their Dialogflow Connector a 32% discount vs. the call forwarding approach on their same platform.

US phone numbers are $1.00/month.

As mentioned previously in this series, you can use the Dialogflow Phone Gateway for free, but I assume one would pay Google for one of its Enterprise plans for an unlimited quota with an SLA.

Making our voicebot

The first post in this series describes requirements for making a voicebot gateway to replace an IVR. In this section I will give some highlights of Voximplant’s setup and review how to implement some of the requirements, with code samples where appropriate.

You can see my entire VoxEngine JavaScript program in this gist.

Note the samples are below are not intended to be a step-by-step development guide - refer to Voximplant for that.

Setup via Marketplace

In addition to the reference guide here for setting up the Dialogflow Connector, Voximplant also added an integration option inside thier Marketplace to make this easy. You can find this off of their main side menu.

From there you will need to go into the Google Cloud Console to get your service account key. Follow the steps setting up authentication guide from the Dialogflow docs to get that. Once you get that and upload it, Voximplant will create a new application that includes the Agent connection. You should see the default Dialogflow Connector code show up in your scenarios list:

You can also access, add, and delete an agent in the Dialogflow Connector menu inside your application. If you add a Dialogflow agent this way it will not automatically create the Scenario code.

Voximplant’s guide mentions this, but don’t forget to set your Text to Speech configuration in Dialogflow to MP3 or Ogg Opus. I forgot this and was wondering why I was only hearing silence.

The code itself is documented so I will give some highlights below.

Dialogflow Events and Contexts

Sending Events

In addition to handling text and speech-based inputs to initiate an intents, Dialogflow can also handle incoming events. The advantage to events vs. sending utterances is that here is no Machine Learning processing of the utterance. Using an event allows you to immediately and definitively invoke an intent.

Voximplant uses this event-based system when it sends an initial query, after connecting:

// Sending WELCOME event to let the agent says a welcome message
dialogflow.sendQuery({event : {name: "WELCOME", language_code: "en"}})

Setting Contexts

In addition, Dialogflow also lets you set contexts. Contexts let you store stateful information across intents during a user's bot interaction. Contexts can also be used as filters when you might want to have different responses for the same intent in different scenarios - like responding differently on a voice call vs. a text message. In our voicebot we wanted to have a phone context that kept the user’s caller ID and phone number. VoxEngine has a dialogflow.setQueryParameters method to do this that you should set before sending a query. To do this we just added the following instead of the above code:

// Set a phone context with phone parameters
// Note: Dialogflow seems to convert camelCase to camel_case, so I just changed it here
// ToDo: error handling if returned parameters are null?
phoneContext = {
  name: "phone",
  lifespanCount: 99,
  parameters: {
    caller_id: call.callerid(),
    called_number: call.number()
  }
}

dialogflow.setQueryParameters({contexts: [phoneContext]})

// Sending WELCOME event to let the agent says a welcome message
dialogflow.sendQuery({event : {name: "WELCOME", language_code:"en"}})

Call Transfer

The bottom part of Voximplant’s code gives examples for handling telephony responses from Dialogflow including call transfer, speech synthesis, and playback of an audio file - all the options available in Dialogflow’s Telephony menu. Basically all you need to do is set your Telephony options there, just like you would do if you were using Dialogflow’s Phone Gateway.

Dialogflow’s built-in Telephony options all work with Voximplant

Just uncomment out the transfer code and enter one of your phone numbers to dial from:

function processTelephonyMessage(msg) {
  // Transfer call to msg.telephonyTransferCall.phoneNumber
  if (msg.telephonyTransferCall !== undefined) {
    dialogflow.stop()
    let newcall = VoxEngine.callPSTN(msg.telephonyTransferCall.phoneNumber, VOICEBOT_PHONE_NUMBER)
    VoxEngine.easyProcess(call, newcall)
  }
}

Recording

Call recording is trivial to do. Just add a call.record({stereo: true}). Stereo is not even mandatory, but it makes debugging easier since the caller and Dialogflow agent are on separate audio channels. I placed this in the onCallRecorded function.

Voximplant puts the call recording in the call history tab. That is all I needed since I just wanted the recordings for debugging. In a real application you could share the recording URL and save it somewhere else.

Example from Voximplant’s call history showing charges charges and recording playback

Playback Interruption

We want our voicebot to be realistic as possible. One aspect of human-to-human conversations is that we interrupt each other. Ideally the voicebot could speak and listen at the same time and even interrupt its own responses if needed. The way Dialogflow’s query API works is you send it an utterance and it returns a response. Sending a bunch of queries back to back is fine, it does not make sense to playback a multiple responses over the top of each other. To avoid this lot of voicebots take a sequential approach:

Stream audio to Dialogflow, which listens for an utterance
When Dialogflow hears and utterance it sends a response to the Gateway
The gateway then stops listening for new audio
The gateway plays back the audio provided by Dialogflow (or synthesizes speech from the returned text)
Then the gateway starts listening again, starting the cycle over

Voximplant does not have a perfect solution for playback interruption, but they do have a playback marker concept that lets you tell the gateway when to start listening again after playback. VoxImplant actually lets you set this to a negative value, so it starts listening again before the playback of the previous intent response is done. If you get this timer right you can allow some ability to interrupt without worrying about overlapping playback.

Voximplant has a single line of code for this:

   // Playback marker used for better user experience
   dialogflow.addMarker(-300)`

It may be possible continuously listen for user utterances and use a dialogflow.stopMediaTo(call) to stop playback if a new intent comes in while the previous intent response is still playing. The logic to get this right seems complex and the application itself might want to choose when it is worth interrupting playback. I did not experiment with this yet.

No Activity Detection

if you were in the middle of a call with another person and they suddenly went silent, you would say “are you there”? The voicebot needs the mechanism to do something similar. Google Assistant provides an actions_intent_NO_INPUT event to Dialogflow to handle this. We can create something similar inside VoxEngine.

First we will need to create a timer to keep track of silence time. I created a Timer class for this:

const MAX_NO_INPUT_TIME = 15000
let dialogflow, call, hangup

// Fire the "NO_INPUT" event if there is no speech for MAX_NO_INPUT_TIME
class Timer {
 constructor() {
   this.expired = false
   this.noInputTimer = null
   }
   start(){
       this.noInputTimer = setTimeout(()=>{
           this.expired = true
           Logger.write("No_Input timer exceeded")
           dialogflow.sendQuery({event : {name: "NO_INPUT", language_code: "en"}})
       }, MAX_NO_INPUT_TIME || 30 * 1000)
       Logger.write("No_Input timer started")
   }
   stop(){
       this.expired = false
       clearTimeout(this.noInputTimer)
       Logger.write("No_Input cleared")
   }
 }

let timer = new Timer()

You will notice my class has a start function that uses a simple setTimeout and will send a NO_INPUT event to the Dialogflow agent if it expires.

Then we just add this start timer to whenever an Dialogflow audio response finishing playing. If the timer runs out then the NO_INPUT event is sent.

   dialogflow.addEventListener(AI.Events.DialogflowPlaybackFinished, (e) => {
     timer.start()
   })
   dialogflow.addEventListener(AI.Events.DialogflowPlaybackStarted, (e) => {
     // Dialogflow TTS playback started
     timer.stop()
   })

I thought this would be all I needed, but when I first tried it didn’t work. It turns out calling sendMedia function right after the sendQuery overrides the stream. To prevent this race conditions, I had to add another check to the onDialogflowResponse function to not send media to Dialogflow if the timer is expired. We want the NO_INPUT response to come back and play first. It will not work without this.

// Handle Dialogflow responses
function onDialogflowResponse(e) {
 // If DialogflowResponse with queryResult received - the call stops sending media to Dialogflow
 // in case of response with queryResult but without responseId we can continue sending media to dialogflow
 if (e.response.queryResult !== undefined && e.response.responseId === undefined) {
   if (!timer.expired)
     call.sendMediaTo(dialogflow)

On the Dialogflow side, make sure to make an intent and populate the NO_INPUT event with some default responses.

DTMF detection

DTMF entry is a nice to have feature in some circumstances, even if the overall goal is to eliminate the traditional “select 1 for…” approach. For example, if you asked for a phone number, a user may prefer to touchtone that rather than say it. In other cases one might need to maintain some existing IVR functionality for scripts that auto-navigate through some options.

Voximplant’s call class has the ability to alert on DTMF tones:

let waitForTone = false // global we’ll need later

// ...
// This is part of the VoxEngine.addEventListener block
// ...

 call.handleTones(true)
 call.addEventListener(CallEvents.ToneReceived, onTone)

Our onTone function then just needs to send an event to Dialogflow. We will add a parameter to pass the digit:

function onTone(e){
 //ToDo: error handling - i.e. check to make sure Dialogflow is connected and running first, tone valuess
 waitForTone = true
 dialogflow.sendQuery({event : {name: "DTMF", language_code: "en", parameters: { dtmf_digits: e.tone} }})
}

In a real app you likely would want to capture multiple digits if they are entered in sequence and send them to Dialogflow as a single event instead of a series of events.

Like with the No Activity Detection above, we need to pause sending media to Dialogflow while our while we handle our event response:

function onDialogflowResponse(e) {
 // If DialogflowResponse with queryResult received - the call stops sending media to Dialogflow
 // in case of response with queryResult but without responseId we can continue sending media to dialogflow
 if (e.response.queryResult !== undefined && e.response.responseId === undefined) {
       if (!timer.expired && !waitForTone)
         call.sendMediaTo(dialogflow)

Lastly, remember to reset our global waitForTone to resume sending media to Dialogflow:

   dialogflow.addEventListener(AI.Events.DialogflowPlaybackFinished, (e) => {
     timer.start()
     waitForTone = false
   })

Like with the No Activity Detection, we need to setup our event in Dialogflow. I created an Entity to handle the parameters passed by the VoxEngine script. This one also needs some training phrases.

No SMS Integration with Dialogflow

Voximplant does not have any VoxEngine control over SMS. This means VoxEngine has no way to send incoming SMS messages to our bot. Like with other platforms, you could always setup your own server and use webhooks to handle this interaction. We were looking for something simpler.

Scorecard

Voximplant ended up being a good (but not perfect) fit against the criteria we identified in the first post of this series:

Requirement	Voximplant Dialogflow Connector
Call Transfer	Yes
Recording	Yes
Playback Interruption	Some - playback marker gives some of the effect
No activity detection	Yes
DTMF detection	Yes
SMS support	No

What’s Next

Now that we have established our requirements and evaluated a few implementation options, in the last few posts in this series we will start to get into architectural considerations and actual implementation challenges. You can leave comments below and remember to subscribe if you want to see more.

About the Author

Chad Hart is an analyst and consultant with cwh.consulting, a product management, marketing, and strategy advisory helping to advance the communications industry. In addition, recently he co-authored a study on AI in RTC and helped to organize an event / YouTube series covering that topic.

Remember to subscribe for new post notifications and follow @cogintai.

SignalWire’s Dialogflow Connector: 2019 review

Chad Hart — Mon, 17 Jun 2019 10:55:00 GMT

Last week we talked about the 3 Methods for Connecting a Phone call to Dialogflow. The third method we discussed was direct connectivity to Dialogflow from your telephony platform. This approach provides the best call quality possible and should cost less since you don’t need to pay for call charges out of the telephony environment or into the Dialogflow Phone Gateway. The downsides are there are few options for doing this and each platform is proprietary. SignalWire is one of the Communications Platform as a Service (CPaaS) who has been doing this for a while and that is what I will review today.

SignalWire introduced a basic Dialogflow connector when they launched last year. In my initial review of that, I discussed some bugs and that the main advantage of their implementation SignalWire was free connectivity into Dialogflow. The updates I noticed this time were:

New connectivity charge per minute to connect to Dialogflow
More TTS voices for the welcome prompt
The start to some Dialogflow telephony integration
Support for Dialogflow agents set to languages other than English

Pricing

SignalWire Dialogflow connectivity is now priced at 2 cents per minute, This is quite a bit higher than their 0.65 cents per minute for placing an outbound call. However, this is still a deal compared to the 5 cents per minute Google changes on their Enterprise plan on top of the 0.65 cents per minute you would then need to pay SignalWire for the outbound call if you used the CPaaS + Dialogflow Phone Gateway approach described earlier.

Example from SignalWire’s usage history showing Dialogflow Connector charges

A cost comparison of the call forwarding approach vs. using SignalWire’s Dialogflow connector. The connector reduces the cost by neary 60%.

Dialogflow Telephony interaction

Like before, SignalWire also has the ability to play a welcome message prior to connecting to SignalWire. They have the complete list of Standard and Wavenet Text-to-Speech (TTS) voices from Google’s TTS to choose from as a default TTS engine, but this selection only works on the welcome prompt. The Agent voice was still what was set inside Dialogflow. I also noticed that the “Synthesize voice” option inside Dialogflow’s telephony integration worked, but this also used the Dialogflow voice selection, not use the Signalwire one. I also tried the Dialogflow Telephony integration Transfer call feature, but that did not seem to do anything when I tested.

Dialogflow’s built-in Transfer Call feature did not work

Call Transfer Webhook

It is possible to setup a transfer using webhooks and one SignalWire’s REST API’s, but this would add some complexity and would require a helper server. I did not see a quick or straightforward way of doing this though, so I did not try it.

Recording

Unfortunately Dialogflow Agents cannot be addressed using SignalWire’s LAML scripting language. This means it is not possible to implement a simple record and forward script to recording the incoming call. You could get around the lack of Dialogflow Agent addressing by forwarding the call to your Dialogflow number, but that would bring you back to Method 2 with its issues, which is what we were trying to avoid with the Dialogflow Connector in the first place.

Hack for SMS support

Twilio is the only vendor to have a native Text Messaging Integration with Dialogflow. This is not documented anywhere, but it turns out you can use that same integration with SignalWire too. Simply just go to Dialogflow’s integrations, select Twilio (Text Messaging) and then insert your SignalWire project key and API token in the Dialog Account SID and Account Token field respectively. Make a note of the generated Request URL:

Now to into your phone number settings inside SignalWire, pop down to Messaging Settings, and place the Request URL into the “when a message comes in” field:

As I do not believe this is officially supported, this is a really a hack, but it works!

SignalWire’s score

SignalWire was not much of a fit against the criteria we identified in the first post of this series:

Requirement	SignalWire Dialogflow Connector
Call Transfer	Maybe, but it is not straight forward
Recording	No
Playback Interruption	No
No activity detection	No
DTMF detection	No
SMS support	Yes - use the Twilio Text Messaging integration

What’s Next

Next week I will provide an update to VoxImplant’s Dialogflow Connector that they introduced here a year ago. There are a lot of changes there. After that we will review the major challenges we encountered in implementing this system and share some implementation examples from our research. Make sure to subscribe so you don't miss anything as we progress through this series.

About the Author

Chad Hart is an analyst and consultant with cwh.consulting, a product management, marketing, and strategy advisory helping to advance the communications industry. In addition, recently he co-authored a study on AI in RTC and helped to organize an event / YouTube series covering that topic.

Remember to subscribe for new post notifications and follow @cogintai.

3 Methods for Connecting a Phone Call to Dialogflow

Chad Hart — Tue, 11 Jun 2019 00:31:00 GMT

We are starting up a new series on using Dialogflow as an Interactive Voice Response (IVR) replacement. We have covered this topic a few times here, including looking at Dialogflow’s own Phone Gateway and the Gateway interface implementations of VoxImplant and SignalWire. Beyond just building simple demo systems, Chad had been exploring improved ways of using Dialogflow to implement an IVR replacement for telephony environments.

Joining me to help with this series is Emiliano Pelliccioni. Emiliano is a developer at webRTC.ventures. Emiliano has worked on similar projects with Chad in the past and has continued researching this area. We decided to get together to share our research and experiments in this domain.

In this first post, we want to share some of the methods we explored to connecting Dialogflow to a phone call. Let’s first review what’s involved in making this connection and some nice-to-have features before reviewing the methods.

What is a Dialogflow Telephony Gateway?

We want to be able to dial a phone number and have Dialogflow handle the interaction as a voicebot IVR. To do this, there needs to be some kind of gateway that handles both signaling and media conversion. On the signaling side, the gateway needs to take the telephony signaling - which is almost always based on SIP - and use that to invoke the proper Dialogflow commands to launch and interact with the bot. This also includes handling hang-ups and the termination of the call.

Slightly more complicated is the media conversation that needs to take place. Dialogflow’s interface for real time speech input is gRPC. The gateway needs to convert the SRTP or RTP media used by the telephony end to a gRPC bitstream using Dialogflow-friendly codecs. The gateway also needs to play back the response speech generated by Dialogflow (or use its own Text-to-Speech mechanism to vocalize Dialogflow’s response text).

What else should the Dialogflow Gateway handle?

Beyond basic connectivity, there are a few other features that will help to improve development and user interaction.

A short list of the top features we evaluated is:

Recording - not a hard requirement, but having a full recording of both parties is invaluable for debugging and improving the system
Call transfer - in most cases you need to give the user an option to talk to a human, or the natural result of a bot will be to transfer the call
Playback interruption - ideally your voicebot could handle an asynchronous conversation, so the user could interrupt whatever the bot is saying and that speech would be processed
No activity detection - if you were in the middle of a conversation and it suddenly went silent, you would say “are you there”? The bot needs the mechanism to do something similar. If you were using Dialogflow to make a Google Assistant bot, they give you an actions_intent_NO_INPUT event and mechanisms to setup reprompt intents - the gateway needs to provide something similar
DTMF detection - even if the goal is to eliminate DTMF menus, sometimes it is nice to have DTMF as a backup option or alternative input method - especially if you are trying to do something like capture a phone number and Dialogflow cannot understand the caller
SMS - nice to have; more on this below

The scope of our project was fairly limited, but other options could include customized Speech-to-Text (STT) and Text-to-Speech (TTS) engines instead of using the ones built into Dialogflow. This could allow for better coverage of custom vocabularies or unique voice synthesis.

SMS support

There are many cases when it is easier to send the user a link. A restaurant wouldn’t want to read off an entire menu - it is easier to just send a link to this. In practice, most callers use their mobile phones to call, which means they should be able to receive text messages. The gateway should help determine if the caller is on an SMS-capable device and send them text messages if that would help in the interaction.

If you are going to send SMS, you should also be prepared to receive SMS and interact via text without requiring voice. In fact, this could allow ongoing dialog after the phone call has finished. To implement this, the gateway needs to keep some state on the user and manage interactions between the voice telephony and SMS environment.

Three Methods for Implementing a Dialogflow Telephony Gateway

We found 3 major methods for connecting a phone call to Dialogflow:

Use Dialogflow’s built-in Phone Gateway
Forward a call from a telephony system into Dialogflow’s Phone Gateway
Directly connect a telephony system to Dialogflow

Method 1: Use the Dialogflow Phone Gateway

Chad previously reviewed this in August 2018. Nothing has changed here in terms of the setup or functionality. The service is still in what Google defines as Beta, but that is usually a pretty high standard. See that previous post for a walkthrough or just go to Dialogflow’s own docs on using the Phone Gateway (which they call “Telephony Gateway” in that link).

English is still the only supported language with telephony if you want to generate speech output from Dialogflow. Other than that, the main benefits of the Phone Gateway are:

It is very easy to setup
It is free, subject to quotas unless you want the Enterprise Edition
A TELEPHONY_WELCOME event is included for special handling of phone calls
The Dialoflow Intent GUI includes a tab for special handling of telephony responses, including audio playback, special speech synthesis, and call transfer (see below)
You can easily transfer calls to a single US number

Dialogflow Phone Gateway Summary Scorecard

So, Dialogflow has some convenient features in its Phone Gateway but in the end, it is pretty limited:

Requirement	Dialogflow Phone Gateway
Recording	No
Call Transfer	Yes, but only to US numbers
Playback Interruption	No
No activity detection	No
DTMF detection	No
SMS support	No

Method 2: Forward Calls to Dialogflow Phone Gateway

Dialogflow does not have a whole lot of telephony controls, but it is possible to use another telephony platform to actually forward calls to the Dialogflow Phone Gateway. This platform could be anything - a commercial PBX or ACD system, an open source telephony platform like Asterisk or Freeswitch, for a Communications Platform as a Service (CPaaS) that provides this functionality as a cloud-based service like Twilio, Nexmo, SignalWire, VoxImplant, and others. We largely evaluated CPaaS so we are only going to refer to that option from here on out. This approach enables some features if you are willing to leverage Dialogflow’s webhooks to interact with the telephony platform.

Forwarding Benefits

Some of the benefits of this approach vs. just using the Dialogflow Phone Gateway are:

Support for any phone number you want - the Dialogflow Phone Gateway only supports US numbers now
Support for multiple phone numbers - you just get one with the Dialogflow Phone Gateway
Easy Recording - this is almost always available as a feature
More control over call transfer if your platform can handle incoming webhooks and programmatic control
Handle SIP calls if your platform supports it
Advanced call flows - such as conferencing in an agent to listen or help with the voicebot interaction

Features like recording are generally simple to implement. Call transfer control usually better if it is possible to tell your CPaaS to hang-up and send a call somewhere based on a webhook fulfillment.

Nexmo’s Kranky Geek AI in RTC Show Example

Nexmo provided a good example of this at the last Kranky Geek AI in RTC event:

You can see that code here:
https://github.com/alwell-kevin/simple-smart-ivr-framework

Forwarding Downsides

The main drawback of this approach is that there is still no direct way for your CPaaS app to communicate with Dialogflow. Dialogflow can only signal something back to your CPaaS App. This means you could have your CPaaS app listen for things like DTMF presses or the user interrupting an audio playback, but the CPaaS can’t talk directly to your app to signal an event.

SMS proxy development may be required

Dialogflow’s Phone Gateway also doesn’t support SMS, but nearly every CPaaS platform does. Unfortunately only Twilio has a Text Messaging option built into Dialogflow’s integrations. While most CPaaS platforms provide easy mechanisms for managing text message interaction programmatically, a developer would need to invoke Dialogflow’s API’s to manually manage these interactions.

As shown in the figure above, this app would need to receive the SMS text content from the CPaaS, and forward it to one of Dialogflow’s API’s. Once the intent response has been received back, your app would use the CPaaS API to send an SMS to the user containing the answer.

Flexibility = more development time

The same may be true for some of the other requirements. If you are willing to manage a lot more code, you could have your app translate between the CPaaS and Dialogflow for some of these features - but that is essentially building a good part of the gateway.

Cost & quality implications

This forwarding is also more expensive - both in telephony cost and call quality. You are paying for the inbound call leg to CPaaS, the outbound leg from the CPaaS to Dialogflow’s Phone Gateway, and then eventually again when you upgrade to Dialogflow’s Enterprise plan (which you will likely do in a production environment). Also, forwarding a call over the PSTN is less than ideal for voice quality - the extra gateway leg will definitely introduce some latency and hopefully does not cause other impairments.

A direct connection from the CPaaS into Dialogflow is more ideal, and that is what we will discuss in Method 3.

Forwarding Summary Scorecard

Forwarding is better, but mileage will vary depending on the capabilities of your platform.

Requirement	CPaaS + Dialogflow Phone Gateway
Call Transfer	Yes - some work to code, but full control over transferring
Recording	Yes - generally easy with a given CPaaS
Playback Interruption	Not unless you run voice activity detection and are controlling the TTS output
No activity detection	No
DTMF detection	No
SMS support	Maybe, depending on the CPaaS

Method 3: Direct Connectivity to Dialogflow

As is illustrated in the figure above, if your CPaaS can connect directly to Dialogflow you will save a conversion step. Assuming you want to leverage Dialogflow’s built-in Speech-to-Text and Text-to-Speech capabilities, this means your CPaaS needs to have a Gateway that can both signal Dialogflow’s API’s and convert RTP media to Dialogflow-friendly codecs over gRPC and back.

Advantages

All the same benefits as in the forwarding methodology
Lower cost - no Dialogflow telephony charges (if paying for Enterprise)
Better quality - I did not quantitatively measure this, but it should have lower latency with better voice quality; use of HD audio end-to-end is also possible (though only used if connecting end-to-end with VoIP)

After that, the advantages really depend on the specific platform.

Disadvantages

The main disadvantage here is the limited number of options available. Nearly every telephony platform provides call forwarding but very few offer a direct Dialogflow connector. After that, the ease of implementation will really depend on the platform selected and your development skills.

Direct Connectivity Solutions

The two main commercial CPaaS options are VoxImplant and SignalWire.

On the open source side, there is the Dialogflow Interface to the Drachtio SIP Server which also requires Freeswitch. UniMRCP has a MRCP interface for Dialogflow.

If you are looking for a licensed, commercial gateway, Audiocodes has what it calls a Voice.AI Gateway that connects to Dialogflow. USAN also has a Dialogflow gateway product. Some quick searches also show Voxibot and Tenios claim to have some direct Dialogflow Voice gateway capabilities.

Direct Connectivity Summary Scorecard

This scorecard will vary considerably by platform since implementation details vary considerably. The basic conversion of media to the Dialogflow gRPC interface seems to be common among all. Few have deep capabilities for interacting with Dialogflow.

Requirement	CPaaS + Dialogflow Phone Gateway
Recording	Yes - generally easy with a given platform
Call Transfer	Yes - some work to code, but full control over transferring
Playback Interruption	Depends on the platform
No activity detection	Depends on the platform
DTMF detection	Depends on the platform
SMS support	Maybe, depending on the platform

What’s Next

We plan to provide updated evaluations of SignalWire and VoxImplant’s Dialogflow connectors next. After that, we will review the major challenges we encountered in implementing this system and share some implementation examples from our research. Make sure to subscribe so you don't miss anything as we progress through this series.

You can see the next parts of the series here.

About the Authors

Chad Hart is an analyst and consultant with cwh.consulting, a product management, marketing, and strategy advisory helping to advance the communications industry. In addition, he recently he co-authored a study on AI in RTC and helped to organize an event / YouTube series on that topic.

Emiliano Pelliccioni is a computer engineer working at webRTC.ventures and specializes in developing real time communication applications for clients around the globe. One of his projects includes developing bot integrations for a major CPaaS provider.

Remember to subscribe for new post notifications and follow @cogintai.

Kranky Geek AI in RTC 2018 Event Review

Chad Hart — Tue, 27 Nov 2018 09:45:42 GMT

Less than two weeks ago I ran the Kranky Geek AI in RTC conference with Tsahi Levent-Levi and Chris Koehncke. Historically we covered mostly WebRTC content but last year we decided to add some AI and Machine Learning topics. This year we focused most of the event on AI. Nearly all of this content is highly relevant to the topics covered by this blog, so below is a summary of the most relevant talks with highlights and links.

Much like our report, we segmented the AI in RTC talks into 4 main areas:

Speech analytics – speech to text with Machine Learning analysis of the waveform and transcript
Voicebots - automated programs that interact with users in a conversational dialog using speech as input and output like Siri, Alexa, Cortana, etc.
Computer Vision – processing video to analyze and understand what is seen
RTC Optimization - machine learning methods used to improve service quality or cost performance

I have details on the talks in each of these areas below.

Speech Analytics

This is a big area and we included three talks – two from services that make heavy use of speech analytics – Voicera and Dialpad - and one that provides a transcription and speech analytics API – Voicebase.

Paralinguistics

paralingusitics
The branch of linguistics which studies non-phonemic aspects of speech, such as tone of voice, tempo, etc.; non-phonemic characteristics of communication; paralanguage.
Oxford Dictionary

Perhaps an better definition for paralinguistics as provided by Voicebase’s CTO, Jeff Shukis:

How something is said, distinct from what is said

In this talk, Jeff walked through how Voicebase's investigation of paralinguistic data and where they ended up. In summary, they ended up capturing the two most dominant frequencies on a per-word level along with a relative energy metric. They also look at the relative volume level and total time spoken for each word to inform on speech rate. For end-customer applications, they end up rolling this into some aggregate metrics across the conversation to look for meaningful changes in tone and volume. The relevance of any one of these features is really determined when all the data is fed into a machine learning model that maps the call data against customer provided agent and call outcome data.

Dealing with custom jargon

Every industry and business has its own custom vocabulary. Words like “WebRTC” don’t show up in a standard dictionary. Most personal and company names are also hard for speech engines. Unfortunately, these terms often end up being some of the most meaningful words for understanding what was said. Dialpad talked about how they address this with a technique they call Domain Adaptation. Etienne Manderscheid, their VP of Machine Learning, gave a step-by-step example on how to implement domain adaptation with Kaldi. Etienne actually agreed to do a complete cogint.ai post with a complete code walkthrough, so I’ll save the details for that.

Voicebots

As I covered here before, IVRs and contact centers are huge new opportunity areas for voicebots. Nexmo and IBM both actually covered this topic.

Contact Center Voicebot architectures and challenges

Brian Pulito of IBM gave more of an architectural talk, setting up some of the challenges contact centers face with traditional technologies and where voicebots and other AI tools fit. Brian also walked through some the challenges in implementing these systems, including:

Dealing with noise
Handling dialects and custom vocabularies
Voice authentication
Handling latency
Using SSML for more natural speech synthesis
Slot filling
Intent training time

Integrating with Dialogflow

We have covered telephony integration with Dialogflow on several different platforms, including Dialogflow’s own Phone Gateway, VoxImplant, and SignalWire. Nexmo gave a walkthrough of their approach for this. Unlike VoxImplant and SignalWire that have a native gRPC integration that sends audio over IP to Dialogflow, Nexmo walked through how to simply forward an incoming call to Dialogflow’s phone gateway. While less than ideal from a cost (you make an extra outbound call) and quality perspective (you add another leg through the PSTN), this is actually very easy to do. In addition, Dialogflow’s ability to add a webhook for fulfillment means you can do a better job of handling call transfers than the native Phone Gateway allows.

Computer Vision

We had a several talks covering Computer Vision (CV). A couple of talks covered many topics, but included some interesting CV-related tidbits:

Intel talked about some of their tools for helping with computer vision in the cloud
Microsoft showed some cool videos showing Mixed Reality broadcasts with the HoloLens

Then we had a few that were entirely focused on CV.

Person detection inside Facebook’s Portal

Facebook recently launched a dedicated video chat device that leverages Facebook Messenger. Portal is mean to be placed in a stationary location inside the home. To make sure it captures the appropriate action, no matter where the call participants are located in the room, it utilizes computer vision identify people and frame them properly.

Facebook is frequently in the news about privacy issues, so a Facebook camera that tracks people has been met with a lot of skepticism. We were lucky that Facebook let Eric Hwang and Arthur Alem talk about the portal implementation at all, even if they weren’t allowed to go into deep detail on most of it. They did show examples of the CV features and discussed the challenges of running their vision algorithms on a constrained device in a latency sensitive application. They also talked a bit about their WebRTC implementation with simulcast and how that applies with user-based video selection and calling to non-Portal Messenger clients.

We also had a few deeper Computer Vision talks from Agora.io and Houseparty. They both focused on used CV algorithms to improve media quality, so I will cover them in the next section.

RTC Optimization

In our report, we were surprised by the relative lack of ML activity for improving media quality and delivery. Talks at the AI-in-RTC made me feel a lot better about this. Houseparty and Agora.io both covered improving video quality and we also had a couple of talks from Callstats.io and RingCentral on using Machine Learning to make better sense of the metadata that accompanies calls.

Super-Resolution on Real Time Communications Video

In the day’s most technical talk, Shawn Zhong of Agora went into details on their super resolution machine learning model. Super-resolution imaging is a set of techniques for improving how an image looks by enhancing a its resolution. Your camera always captures at a high resolution, but in RTC this resolution is reduced by the video encoder to match available bandwidth and processing constraints. The image quality is often then further reduced due to packet loss during transmission. Super-resolution aims to restore the video quality back to the original before the degradation.

Tony talked about how they use a Generative and Adversarial Network (GAN) to address this problem. He spoke in detail on how GAN’s work here and the challenges of optimizing heavy Deep Neural Networks (DNN) for use on a typical mobile device.

Open Source ML to Improve Video Quality

Gustavo Garcia Barnado previously analyzed the use of Google’s ML kit to provide smile detection on iOS for webrtcHacks. For his Kranky Geek AI in RTC talk, Gustavo shared some experiments he did using machine learning to improve video quality. This is super-resolution again. Unlike Shawn who approached the super-resolution program from a low-level Machine Learning perspective, Gustavo used an Artifacts Reduction Convolutional Neural Network (AR-CNN) model he found for Tensorflow to see what would happen. When it worked they migrated it to work on CoreML for Apple devices.

If you are a Machine Learning expert like Shawn, you can build an optimized end-to-end model. But if you aren’t an ML PhD, Gustavo shows how you can still get great results leveraging open source libraries.

Analyzing Call Statistics with Unsupervised Learning

For WebRTC deployments, the RTCStats API is a great way to collect a lot of data. However, in many ways it has become too good and it is easy to get lost in all the data collected. Varun Singh of Callstats showed how they used Principal Component Analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE) for clustering users with similar issues together. Varun walks through a real word example where they were able to identify some specific ISPs that were causing issues for users across different WebRTC apps.

Using ML to normalize call quality data

In VoIP systems, Call quality is generally measured by end user and intermediary devices using a metric known as Mean Opinion Score (MOS). RingCentral noticed all of these devices measured MOS slightly differently. This created a lot of operational headaches whey trying to identify and troubleshoot call quality issues. Curtis Lee Peterson of RingCentral’s operations team talked through how they took more than a million data records and ran them through a Tensorflow model to provide normalized, accurate MOS data.

More?

See the complete Kranky Geek AI in RTC playlist here. The list incudes a few additional WebRTC-oriented talked not incuded above including Google's annual talk and one from Discord describing how they adapted the WebRTC stack to handling 2.8 Million concurrent callers in their gamer chat app.
Make sure to subscribe for future videos.

If you are looking for market research, we are running a special discount on the AI in RTC report. Contact me about that for more info.

Our next Kranky Geek event in San Francisco is scheduled for November 15, 2019, so mark your calendars now!

About the Author

Chad Hart is an analyst and consultant with cwh.consulting, a product management, marketing, and strategy advisory helping to advance the communications industry. He recently co-authored a study on AI in RTC - check it out at krankygeek.com/research.(https://krankygeek.com/research).

Remember to subscribe for new post notifications and follow @cogintai.

SignalWire's Dialogflow Connector

Chad Hart — Thu, 13 Sep 2018 10:59:48 GMT

In my last post I reviewed Dialogflow's new Phone Gateway feature. Dialogflow makes it dirt simple to add a phone number to a bot, but the Phone Gateway feature is pretty limited (and it is in Beta after all).

What if you like Dialogflow for building your bot/IVR, but do not want to use Google's numbers or PSTN gateway service? There are a growing number of third-party alternatives looking to expand voicebot capabilities into business IVRs. Google announced a number of telephony platform partners that are interfacing with it as part of Google's Contact Center AI initiative. However, these programs are all in Beta with very limited access and appear to largely be aimed at existing large contact center customers of these companies. VoxImplant and SignalWire are 2 CPaaS options that started down this path much earlier and have products they can show publicly.

Alexey of VoxImplant gave a preview of their Dialogflow Connector in an earlier post here. A review of SignalWire was a logical next step, so here it is.

SignalWire is from the team behind FreeSWITCH, the popular open source telephony platform. SignalWire is a CPaaS based on a lot of that software that has been tuned for purpose. It was only launched late this July. It is currently in a closed Beta, but that should be ending within weeks. Developers can request a login at signalwire.com

I provide a qualitative, less-technical assessment of SignalWire's Dialogflow connector in the first part below and a short walkthrough further down.

SignalWire Assessment

The Good

I see a few benefits in using SignalWire's connector vs. Dialogflow's Phone Gateway.

Multiple numbers

You can add as many phone numbers to your agent as you want. The Dialogflow Phone Gateway only lets you pick one per agent. Multiple numbers for the agent allows you to localize them for a geography. Running multiple numbers on the same endpoint is also critical for call tracking in campaigns where specific phone numbers are tied to ads.

Potential SMS integration

Dialogflow provides their Phone Gateway for voice calling and a number of options for SMS interaction. Surprisingly, there is no way to combine voice and SMS on the same number with any of their standard integration options. SignalWire does not have SMS-to-Dialogflow-chatbot functionality built in natively, but this is possible to build via their API interfaces (in what I would say would take a few hours if you know the Dialogflow APIs).

Pricing

Currently SignalWire only charges $0.005 per minute for the call and their phone numbers are currently free. If you want a permanent number that doesn't expire after 30 days, Dialogflow charges you $0.05 per minute. The reality is SignalWire is not charging users for their Dialogflow fees. I do not see how this is sustainable, but the SignalWire team assures me they will be competitive with Dialogflow's Phone Gateway rate. In any case, you can't beat the price for now - ⅒ of the Dialogflow Enterprise Essentials plan price.

Telephony logs

Dialogflow's GUI lets you see intent history and they have the ability to export their logs to Stackdriver on Google Cloud or Chatbase. I have not found an easy way to just see sessions that came in over the phone or to filter them by phone number. SignalWire gives you logging by phone number with a history of the interaction. Unfortunately that interaction does not include the Dialogflow responses. Still, the provided speech detection inputs are helpful for debugging and identifying problem callers.

No Google Voice integration issues

This is not a problem for most, but its a huge one for me. Dialogflow doesn't work when you call it from a Google Voice number. I don't have good reception in my office and generally only use Google Voice, making testing of the Dialogflow Phone Gateway difficult. Its nice to be able to call the agent number and have it work no matter how I dial.

Bugs

I ran into a few bugs that I reported. To be fair, SignalWire is in a closed Beta and I suspect most of these issues will be quickly addressed.

Agent voice

Dialogflow lets you choose from multiple Text-to-Speech (TTS) voices in their audio responses. Unfortunately this is limited to US-English if you have the phone gateway. SignalWire currently only uses the default US voice. So if you set your Dialogflow bot to something different than the default you won't hear that voice. The SignalWire team says a fix on this is imminent.

Say a welcome message repeats on silence

The "Say a welcome message" field you set in the SignalWire agent setup automatically repeats when it the user does not say anything for several seconds. This make it a more useful reminder prompt when the user goes silent. Dialogflow has some of those capabilities built in already, so I am not so sure another set of prompts outside bot framework will be helpful. Even at call start, when SignalWire says this message will be played, I think it is better to just let the Dialogflow Welcome intent handle the transition. Fortunately you can just turn this off if you don't like it.

Unverified app warning

When authenticating with Dialogflow, it gives an unverified app warning as shown below:

The SignalWire team say this is a recent bug they are working with Google to fix it.

Conclusion

This was all very easy to setup. Then again the Dialogflow Phone Gateway was easy too. So why bother with using an external party like SignalWire? I think the most compelling reason to use SignalWire is the ability to influence its development. Team behind it does custom development, so if you really need something you can pay them to do it. That's not an option with Google.

In addition, Google is not a CPaaS and is unlikely to go too deep down that path (thus partnering with some telephony platforms instead). Just being able to add some programmability to the process is also helpful vs. what Dialogflow has today. SignalWire's programmability on Dialogflow is pretty limited today, but its not hard to see how this will change quickly the way they have their system setup. The SignalWire team tells me they will be adding more logic commands for Dialogflow to let you invoke it like the other recording and conferencing parts of their offer. That would let you make a recording of the whole call or join an agent into the Dialogflow session.

Overall I think there is a lot of potential for this category. There are a lot of crappy IVRs with complex telephony environments. If you want to add a voicebot to these environments, some kind of RTC-bot gateway is required, and that is what we are starting to see.

SignalWire Walkthrough

First you will need to setup an account. After a few confirmation emails and some Space and Project setup you should come to a screen like this where you can select a Dialogflow agent.

If you click on the import you'll need to authenticate to your Dialogflow account. As mentioned above, you might get a "This app isn't verified" screen where you need to click on Advanced to proceed. Assuming you trust the SignalWire team up to this point, it's fine to proceed.

After that you should see the list of Agents:

Just select the ones you want to make available and then configure them in this screen:

Here you have the option of saying a welcome message and identifying some text to feed to Dialogflow when the call starts. As mentioned earlier, the "Say a Welcome Message" is better used as a silence prompt if you use it at all. In my bot, "Hello" and similar phrases are tied to the Default Welcome Intent in Dialogflow. Adjust or turn off as needed for yours.

That's all you need to do to connect to Dialogflow; now we just need to get a phone number and assign it to the Dialogflow agent. That is also super simple - just go to Phone Numbers and enter an area code:

Select the one you want to Buy - as mentioned above, its free for now, so there is no buying.

Then you'll come to a new phone number setup screen. Again, no rocket science here - just fill in the descriptions, pick "A DialogFlow Agent" for handling incoming calls, and select the appropriate agent.

Make a call to your number and verify it works. That's it!

About the Author

Remember to subscribe for new post notifications and follow @cogintai.

How to make a dial-in IVR with Dialogflow

Chad Hart — Thu, 30 Aug 2018 11:21:52 GMT

Continuing on my quest to see all clunky, hierarchical IVRs die, I have been experimenting with Dialogflow's new Phone Gateway and Knowledge Connectors that were launched in late June at CloudNext. The Knowledge Connector extracts data out of a Frequently Asked Questions (FAQ) document and makes it available for the bot to use in its responses. The Phone Gateway includes lets you connect any Dialogflow bot to a phone line. Together, these two features make it relatively simple to build a voicebot replacement to a informational IVR with the added bonus that the bot can be reused on other media too.

Read below for:

some background on Dialogflow and the IVR use case,
a detailed walkthrough of how to do it in minutes, and
some of my learnings and potential next steps.

Concepts and Ingredients

Why is replacing a traditional IVR with a voicebot better? What are these new Dialogflow features exactly? Let's review some of the core principals first.

Menus vs. Intents

Traditional IVR systems always use menus because that is the only way they could navigate through a bunch of explicit options. In this approach, the number of items in a given menu were limited by 2 things:

The number of DTMF keys on the phone
User patience to listen through a list of options and then make a selection

Voice recognition - where users can speak their options - is supplementing the DTMF entry. However, that technology isn't everywhere and the main challenge of how many menu items can a user remember before they forget and have to go back through the menu still remains. This means there is pressure to keep each menu short. To fit all the options needed under this constraint, IVRs must have a multi-layered menu hierarchy.

Imagine if people had conversations like this. When someone calls a receptionist at a business they don't spout out a menu, they ask "how can I help you" and then response with the appropriate answer. Intent-based Natural Language Understanding (NLU) systems that voicebots interact like a real person would. They take what the user says - their utterance - and match it to an intent which indicates a response. Conversational bot systems are actually much more dynamic than this with follow-on intents, response variations, and other options for slot-filling, but the core concept is the same.

Dialogflow

As it competes with Amazon's Alexa, Google has been launching a bunch of new features for Dialogflow, its natural language processing and bot builder tool. Unlike Actions for Google, which only lets you create bots for the Google Assistant, Dialogflow has integrations with a wide variety of platforms including Alexa, Cortana, many popular chat platforms, and now the telephone network.

Informational Interactive Voice Response Systems

Interactive Voice Response (IVR) systems have become the bane of telecoms, but they were once a convenience for users. They help direct users and reduce time dealing with multiple transfers and clueless operators. The main primary functions of the IVR come down to:

Directing callers to the right person
Offloading agent (the person answering the phone) by providing or collecting information before the agent gets on

Dialogflow can help with both of these, but here I am going to focus on the offloading part. Large business like banks have sophisticated systems that let you make self-serve transactions (today as an alternative to doing that on the web or an app). Many smaller organizations still have IVRs that provide basic information - opening hours, addresses, etc. Even more sophisticated systems usually have pieces of their IVR that does the same thing. For this post, I am going to focus on these "Informational" elements that can be provided by an IVR.

Dialogflow Knowledge Connector

Web-based Frequently Asked Questions (FAQs) have become a popular content format on the web mainly because of the way Google search provides featured snippets. FAQs are usually easy to write and contain a lot of good information. The problem is FAQs are often not displayed prominently on a website so often primarily for the search engines. In addition, having them doesn't help if someone calls in on the phone unit now.

The concept of taking FAQ content and adding to a bot is straight forward, but the work of populating that data is tedious. Microsoft had a tool called called QnA Maker that helps to automatically parse Q&A style documents to populate a bot. Just point it at a webpage or document and it uses NLP technology to suck in the data.

This summer Google released something similar in Dialogflow called Knowledge Connectors.

Phone Gateway

The other big piece here that Google recently launched is the Phone Gateway. I have talked a few times on this blog about voicebot gateways from vendors like VoxImplant and Signalwire that added a telephony gateway to Dialogflow. Now, much like IBM, Google has added its own Phone Gateway.

Google provides access to numbers directly from within Dialogflow interface. If all you care about is connecting a phone number to Dialogflow, I don't think anyone could offer a simpler way to do it. Better yet, if you just need a number for testing Google gives them away for free for 30-days.

If you want a permanent number they sell those too - you'll just need to sign-up for their Enterprise plan which will increase your bot price from totally free to $0.05/minute to $0.075/minute depending on the plan and number type.

Presumably their Contact Center AI telephony partners provide something similar to their customers from their own interfaces.

How To Guide

Here is a step-by-step guide on how to:

create your agent,
add the phone gateway integration,
import a number of FAQs with the knowledge connector,
make the bot more conversational,
transfer calls to a human.

Dialogflow has many features and I am not going to go into most of them. Also, this is just a quick and dirty demonstration, so remember this is a lot of tweaking and improvement you would want to make in a real bot.

Get an account, create an agent

If you don't have a Dialogflow account already, go to dialogflow.com, sign-in with a Google account, and pick your account settings.

If it is a new account, you will be forced to create an agent:

For reasons I will explain later, I decided to make a Star Wars movie hotline.

Edit your default intents

Next you will come to an Intents screen (or click on Intents on the left).

The Default Welcome Intent is the first thing your bot will say. The Default Fallback Intent is what the bot says if it does not understand the input.

Click on Default Welcome Intent. You'll see a bunch of training phrases and then you will see a bunch of Default text responses. I modified mine to respond with some statements prompting the user to ask about Star Wars:

Click on Save and you should see some messages saving Agent Training Started and a completion message soon after. Once that's complete, you can do a quick test by typing in your query in the upper right. It should return with one of your responses. Note how you can enter nearly any greeting and it will respond. If you have a non-standard greeting it does not pick up, just add that to the Training Phrases.

Setup the Phone Gateway

Before we go any further let's setup our Phone Gateway and make sure we can call into our bot. Click on the sandwich menu icon in the upper left and click on Integrations. There are a lot here, but since the scope of this walk-through is an IVR voicebot, pick the Dialog Phone Gateway icon. Then you'll get a screen like this:

This is where you can pick a phone number and other options. The menu only lets me pick English, but I assume other language options will be available over time. Optionally you can enter some area codes that you want your phone number to start with. Pick a phone number, click create, wait for the confirmation and then close out of there.

That's it - now you have a phone number you can use for 30 days. Per the Note in the screen above, if you want a number for longer than that then you need to sign up for the Enterprise Edition.

Give your number a call and you should hear your welcome intent. It should respond to greetings, but not much else, which isn't very useful so let's move to the next phase.

Knowledge connectors

Usually programming a bot involves manually create a bunch of intents and responses like we say on the Default Welcome Intent screen. Dialogflow has some Prebuilt Agents you can modify if they happen to fit your use case. This saves some time, but it is possible to skip the Intent entry step entirely using Knowledge Connectors. As explained earlier Knowledge Connectors lets you populate the Intents and responses automatically from a Frequently Asked Questions (FAQ) document.

Enable Beta Features

Google's is in Beta, so we will need to enable the feature. To do this click on the gear icon next to your bot name in the upper left and then click the slider for Enable beta features and APIs and click save:

Add Knowledge Base URLs

Now head back to the left menu and pick Knowledge and then Create Knowledge Base.

Then you'll need to pick a name for your knowledge set.

After that you will see a link to create a knowledge document.

From there you will need to fill in the form with a document name, FAQ for the knowledge type and text/html if you are using a URL, followed by the URL itself:

Click Create and a few moments later you should see the document in the list.

If you don't see the document, jump down to my issues section below. If you do see the document, you should be able to open it and see the full Q&A. That screen will also let you disable individual Q&A pairs.

I went back and added a bunch more FAQ documents to round out my knowledge set. Make sure to hit save when you're done.

Give your number a call and ask it some questions related to the content in your FAQ's and you should get answers. I found I had to turn up the preferences to return Knowledge results in the main Knowledge tab for better results:

Once you start adding other intents this will need to be tested and adjusted more.

Add Small Talk

Without any real data entry, your bot should be in decent shape, but it is hardly conversational outside of the narrow FAQ documents you gave it. Dialogflow has another feature called Small Talk that provides default responses to popular filler discourse, like "thank you".

Go over to Small Talk in the menu and enable it.

Real intents

We made it pretty far without doing anything complicated, but our bot is missing a few features you would have:

The ability to hang-up
Option to talk to a human

Hang-up intent

We never setup a way to end the conversation. The user can hang-up, but the usual protocol is to say some kind of farewell first before hanging up. This is easy to add and is included in most of the Prebuilt Agents if you started that way.

Go to Intents. Create a new intent and call it something like "Hangup" and then add some farewell training phrases. Then, customize your Text response and make sure to click the Set this intent as end of conversation.

Make sure to do some tests. I found phrases like "I'm finished" matched some of my knowledge base articles. Adjust the Knowledge results preference slider if you need to.

Transfer to operator intent

To make this work cleanly we are going to add a new intent so the user can signal they want to talk to a human. Then we will make a follow-up intent to confirm and transfer the call. Technically this could be done in one step, but since this was meant to be a FAQ bot we don't want users to be transferred accidentally.

Click on the plus sign next to Intents and create an intent called Operator. Add some "speak to a person" training phrases and add a confirmation prompt as a response.

As always, make sure to save, let it train, and then test.
Now go back to the Intents screen, hover over the operator intent you just made and create a follow-up intent:

Select "yes" from the drop down menu - this will auto-populate confirmation intent phrases.

Set it to dial out

Lastly, go to the Responses section, click the "+" and select Telephony.

Click on "ADD RESPONSES" and select "Transfer call". Then just put your phone number in the box.

Give your voicebot a call and make sure it works.

Tips, Tricks, and Issues

It is important to remember that the Dialogflow Phone Gateway and Knowledge Connector are in Beta. Google's "beta" usually just means "new" but I encounted a bunch of bugs and glitches so it see this is really in Beta.

Don't use Google Voice

Don't use a Google Voice number with the Phone Gateway - it doesn't work.

Note: Calling the gateway from a device using Project Fi, Google Voice, or Google Hangouts is not currently supported.

This was painful for me since I do not have cellular reception in my office and generally use my Google Voice number to dial over VoIP. Ironically I ended up using another bot - Alexa on my Amazon Echo - to place test PSTN calls to the voicebot.

Parsing FAQ documents

The Knowledge connector was super finicky for me. I originally wanted to build voicebot on the AI in RTC report FAQ but I could not get it to parse the document. I kept getting "failed to crawl" errors, despite several attempts to restructure the document in several ways.

That FAQ built in Wordpress. Other FAQs I made in WordPress in the same manner parsed just fine, so I am not sure what it is.

I recommend just using a CSV file as I'll discuss below.

No interrupting playback

A lot of the answers I got off of the IMDb movie FAQs were very long. Unfortunately I was not able to find a good way to "Cancel" in the middle of playback. On the Google Assistant you can always do an "OK Google" to interrupt the response playback, but I did not see a way to implement that mechanism on the Phone Gateway.

Also, for the really long ones, occasionally the intent recognizer started up again before the previous text-to-speech response even finished. I suspect this is a bug.

Keep FAQ responses short

In most cases users don't want to hear a bot drone on, especially when they can't easily cancel. It is supposed to be a dialog after all, not a lecture. For this reason I recommend keeping the FAQ responses to a few sentences vs. a few paragraphs.

Unfortunately there is no easy way to break up long responses to allow the user to choose to continue or not. You could do this if you converted all the FAQs questions to intents and then break them up to a series of follow-up intents. Dialogflow does allow you to easily convert a FAQ to an intent. You could use this feature to break up the text into follow-up intents, but that involves a lot more work and assumes your text is already structured in a manner that makes sense to interrupt in the middle and say "would you like to hear more". Alternatively you could directly interface with Dialogflow's APIs and webhooks for more control, but that requires a developer.

No editing FAQs in Dialogflow so use CSV's

Unlike Microsoft's QnA Maker, there is no way to edit or add the FAQ items within the GUI after they are imported. Given all the hassle I had parsing HTML pages, I think its easier to scrape the page content into a CSV where you can edit the text outside of Dialogflow and reimport. That also gives better controls over content versioning.

Sometimes the interface is slow, be patient

I found in a bunch of cases changes wouldn't stick. It seems sometimes the interface is slow. Make sure to wait for the "Agent training completed" toast message before testing something new.

Other times I would get random 500 errors. They eventually went away on their own.

Conclusions

I tried creating another bot for the AI in RTC report FAQ. Since I still could not import from the URL, it took me about 10 minutes to copy and reformat the contents in a CSV file. From there it took me about 25 minutes to walk through these steps, including testing, to make it work. It is very far from perfect but not too bad as a base to get started in a short period of time.

Give it a try at +1 617-380-7150 and see how it works for you. In the mean time I will think about turning this into a proper voicebot!

Remember to subscribe for new post notifications and follow @cogintai.

AI in RTC Report Highlights: Speech Analytics & Voicebots show the most promise

Chad Hart — Wed, 22 Aug 2018 13:26:52 GMT

We hear it all the time - Machine Learning (ML) and its use in creating Artificial Intelligence (AI) applications is having a profound application on many industries. Tsahi Levi-Levent, blogger and one of the co-organizers with me in the Kranky Geek event series decided to team up to do a deep study on the application of Artificial Intelligence in Real Time Communications (AI in RTC). After a couple months of research, many dozens of conversations, and analyzing hundreds of products we put our findings into a 147-page report. See below for more on the study and some of my take-aways.

About the Study

From the outset, we decided to focus on four domains:

Speech Analytics – converting speech to text (STT) (aka ASR or transcription) and analyzing the waveform and converted text. I had done a lot of work here launching a speech analytics service and was familiar with the new, burgeoning ecosystem here.
Voicebots – automated programs that interact with users in a conversational dialog using speech as input and output like many IVR systems and Amazon’s Alexa. Like almost everyone, I hate IVR’s so I was very interested to see how advancements in this tech could make IVR’s suck less. I had also previously investigated the use of using voicebots in conference calls to help control the bridge and was wanted to explore this and other use cases.
Computer Vision (CV) – programs that analyze and understand images and video. I had done a lot of my own experimentation here on webrtcHacks. Other than some overlay features in social media apps and some virtual name tag demos from Cisco and Microsoft, I wanted to understand why video conferencing providers weren’t doing more with CV
RTC Optimization – machine learning methods used to improve VoIP media quality or cost performance. I knew the least about this area but given the recent focus on controlling bandwidth and error correction by RTC stack designers, I expected we would find some interesting research on this topic.

The only other area outside of these our research uncovered was the use of machine learning for route optimization in the call centers – i.e. determining which agent to pass a given customer to based on the problem, agent expertise, user history, etc. We decided to not investigate this time around because while it can be RTC related, it did not involve processing the media stream and is a more general problem for all customer interactions.

Methodology

In addition to our own personal experience and on-going work, we conducted significant primary and secondary research. Our main source of information was company interviews. We identified more than 100 target companies that included:

RTC companies - telcos, CPaaS providers, cloud-based Unified Communications and call center providers
AI companies – speech analytics, voicebot, computer vision, and other ML-technology vendors

In the end we interviewed about 40 of these companies and did deep reviews of the others. To supplement these interviews, we also conducted a web survey where we had 96 unique companies of all varieties respond.

Findings

These interviews, subsequent analysis, and writing the 147-page report was major time commitment, unfortunately we can’t give everything away for free, but here are some of my take-aways:

Speech analytics is where all the action is – the majority of the companies we covered had some speech analytics initiative. I was not surprised this was such a popular area based on my recent professional work, but I was discovered many new ones and re-discovered some vendors who have been in this domain for a while.
Voicebots are the next big AI in RTC domain - the area has perhaps the move immediate potential for RTC apps as voicebot technology by the major cloud vendors is being commoditized and surpassing traditional conversational IVR implementations. The hard part here is integrating conversational AI tech with established telephony environments – I covered that in my last cogint.ai post.
Computer vision hasn’t received much attention – outside of social media apps, video RTC companies are just starting to look at it and CV-tech companies have been focused on other markets, like autonomous cars
Only big cloud vendors do everything - outside of the major cloud vendors – Amazon, Google, Microsoft, and IBM – the core machine learning technology vendors were only focused on one domain.
No one is using ML in RTC stacks - with only a couple exceptions – see 2Hz and some Mozilla research – hardly anyone is leveraging ML to improve their low-level VoIP mechanics.
Lack of ML expertise is an issue for RTC companies – lack of staff who know and can apply ML was cited as the number 1 inhibitor in our survey.
Promise & peril from partnering with big AI cloud vendors – the big cloud vendors are also big AI vendors. They are proving the value of AI technologies and democratizing ML tools, meaning it will only get easier for RTC companies to work with them. At the same time Amazon, Google, and Microsoft also offer their own communications products and services. RTC companies without their own ML expertise are in a difficult situation of relying on technology from companies that are growing increasingly competitive.

Overall there are more use cases for using AI in RTC than I imagined. At the same time, very little effort has been placed exploring most of these use cases. In addition, most RTC companies are nowhere near the head of the curve when it comes to ML so don’t expect any immediate general market shifts. On the other hand, AI technologies are easier to find on opensource repositories and purchase than ever. Even if there isn’t always an easy path ahead, the potential here is extremely exciting. Based on our conversations and analysis, I expect AI in RTC will only become a bigger topic with a growing variety of implementations and use cases.

More information

You can see the full table of contents, list of figures, download a report preview over at krankygeek.com/research. For those with some budget for market research, we are offering a publication launch discount until September 7. Ask me for details or visit the report site to purchase.

Chad Hart is an analyst and consultant with cwh.consulting, a product management, marketing, and strategy advisory helping to advance the communications industry.

Remember to Subscribe and Follow us.

When will Voice AI Replace the Call Center Voice Channel?

Chad Hart — Tue, 03 Jul 2018 13:20:49 GMT

Google’s Duplex demo a few months ago forced us to think about a world where we might get a random call from a business and not realize it is a robot making the call. Recent reports have verified that this technology is not just some staged demo and actually works. While there are certainly many sticky implications of this from a consumer perspective, voicebots are already an interesting channel for contact centers. Text-based chatbots have been consuming more and more customer interaction-minutes. With the emergence of improved voice technology, will customer<->agent voice interactions decline further in favor of this new kind of voice channel?

Voicebot 2.0

Voicebots are not really new technology. High-end IVR systems have been able to handle basic voice interactions for years. However, these systems were expensive and therefore tended to be used only in large, well-funded call centers. More recently new machine learning approaches have fueled some major advancements in bot and speech technology which has in turn created a new, rapidly growing ecosystem.

A large part of voicebot tech is actually derived from text-based chatbots. The main advancement here has been around Natural Language Understanding (NLU) - capabilities that let software match the vagaries of human speech to specific actions while extracting the right objects and figures from speech. The methodologies for programming bots is starting to converge, helping to spread best practices and reduce the learning curve. Platforms like Google’s Dialogflow (formerly api.ai), Wit.ai (owned by Facebook), Microsoft’s LUIS, IBM’s Watson, and Amazon’s Lex provide, along with a few yet-to-be acquired start-ups and some open source projects, provide relatively easy platforms for building bots.

Second, speech recognition and speech synthesis technologies have significantly improved. New machine learning approaches continue to help improve recognition accuracy and make computer speech sound more human. In addition, cloud services started to offer machine learning-centric hardware as a service. These help to run these heavy, specialized loads efficiently at lower costs.

Example voicebot architecture from a Watson Voice Agent. Source.

Lastly, voice assistants and smart speakers – largely from Amazon and Google – have helped to bring bot and speech technologies together. More importantly, they have made a growing portion of the population comfortable interacting with machines via voice through massive exposure. Amazon effectively introduced the smart speaker category in late 2014. By the end of 2017 it had shipped more than 30 million of the devices with Google gaining share behind it. Pew Research reports that more than 46% of American now use digital assistants.

Usage is not ubiquitous, but ubiquity is certainly on the horizon. These new voice platforms need new apps to be successful, so major investment has gone into making it easier to develop voicebots. Often these development tools can be used on more than one voice platform with the assumption there will be multiple overlapping ecosystems.

Internet Trends Report 2018 from Kleiner Perkins Caufield & Byers

Why aren’t there more Voicebots in the Call Center?

If voicebot technology is becoming widespread, why do I still need to spend minutes navigating through a typical IVR instead of just saying what I want?

Even though voicebot technology originally debuted for call centers, the industry overall is well behind what is happening in the consumer market. There are a few reasons for this. One is just the ability to integrate the voice bot platforms with today's telephony systems is difficult. The platforms generally do not offer their own native telephony interfaces (i.e. SIP & RTP). This means you need to do that integration yourself. Second, there is often data privacy and compliance issues at play. Just like with chatbots, these can be overcome, but the voice channel is new and more complex. There are also a lot of trust issues. Yes, everyone is moving to using public cloud infrastructure, but there is a general reluctance to send all your customer interaction data to someone like Amazon when they might be competing against you tomorrow (or are already with Amazon Connect)? Many traditional call center infrastructure providers are well tuned to these needs, but they also lack the millions of users and vast ecosystems that make the voice assistant platforms powerful in the first place.

New telephony voicebot platforms are emerging

Google is not the only one with fully automated voicebots that handle telephone calls. VoxImplant is a Communications Platform as a Service (CPaaS) vendor that introduced a DialogFlow connector a few weeks ago. In addition, Voca.ai is a recently launched startup with a telephone voicebot aimed at financial call centers.

It’s not just start-ups with voicebot platforms for telephony. IBM is in this game today too. Leveraging Watson - IBM’s Jeopardy! champion AI - for use in the call center used to be a lot of work. IBM has made a lot of recent progress in simplifying access Watson for telephony use cases. In addition to making the development of its bots much simpler, IBM also recently put its Watson telephony gateway in their IBM Cloud service. This effectively makes Watson another CPaaS offer with the ability to connect a SIP channel in minutes.

Bots like Outbound Calls

Interestingly, like the Google Duplex demo, outbound calls tend to be a better fit for the technology in its current state. Unlike an inbound call where it is hard to tell what the conversation will cover before engaging in the conversation, the call center dictates the agenda in an outbound calling scenario. Those calls tend to be more specific, making it easier to tune the NLU engine for the scenario.

The Voice Channel Isn’t Going Away, but it will be supplemented by Voicebots

So, will the introduction of more voicebot technology into the contact center mean agents aren’t needed any more? The data actually indicates just the opposite.

A 2017 Deloitte study of 450 diverse contact centers showed that customer interactions on the voice channel will fall by 17% by 2019, but still remain significant at 47%, three times more than web chat and email, the next largest channels. More importantly, their respondents expect customer interactions to get more complex and that voice is by far the leader in dealing with complex interactions. Supporting this, the US Bureau of Labor Statistics predicts the larger “Telephone call centers” category will increase in headcount by 27.4% between 2016 and 2026.

The US BLS tracks employment data by category. This is the 2016 data and 2026 projection NAICS 561420 - Telephone Call Centers. Source: US Bureau of Labor Statics

Self-service, voice-bots, and automation are not predicted to reduce overall agent headcount. If anything, it appears this technology will be needed to support increased demand for contact center interaction. Today’s simplistic IVR systems are certainly poised to be replaced by voicebots. However, as this happens, agents will need to handle an increasing load of more complex tasks.

It is early days for this technology, especially in the contact center. Voicebots might not replace agents on the phone, but the technology will certainly be a larger part of the mix in the days to come.

Interested in a deeper dive this an other topics in AI and RTC? Check out our upcoming report. If you participate in our web survey on this topic you get an ebook along with a chance for 1 of 5 $100 Amazon Gift cards.

Chad Hart is an analyst and consultant with cwh.consulting, a product management, marketing, and and strategy advisory helping to advance the communications industry.

Remember to Subscribe and Follow us.

Computer Vision Training, the AIY Vision Kit, and Cats

Chad Hart — Tue, 05 Jun 2018 17:53:24 GMT

This is a story of how to do a custom pet detector using computer vision. For those who are looking at examples of how to apply computer vision, I hope to illustrate a general methodology that should be applicable across a wide range of applications. My particular issue was cats on my kitchen counter, but there is a wide variety of scenarios where you have a camera that is focused on a particular area and you just want to identify when certain objects appear that don't belong so you can track them.

For anyone that has an AIY Vision Kit, this will also act as a starting guide for how to do custom vision training on your kit until Google does a better job of documenting this. We'll get more technical with some code and specifics around the Vision Kit at the end.

The Challenge 🐈

Cats. They can be cute, but they can also be evil. Mine have got into the habit of jumping on the kitchen counter. They lick food they and throw random objects on the ground. I know don't like the taste of any of this stuff, so it is probably out of spite. The other day one remorselessly murdered my Alexa by dropping my Echo Home off a 7-foot shelf. Not cute anymore 😒.

Fig 1. Kitty did not take kindly to the Meow! skill and pushed Alexa to her death

I have fixed pet problems before with modern technology, so I figured I could do the same again using the new AIY Vision Kit I picked up a few weeks ago. To start I just wanted to use Computer Vision to alert me when they jump up where they aren't supposed to be.

First, some Terminology

If you are brand new to Machine Learning and Computer Vision, here are a few terms to keep mind:

Class - one or more types used to identify an object in an image. For example, the Dog/Cat/Person model has 3 classes - dog, cat, and person.
Image Classification - the process of using machine learning to identify classes in an image
Object Detection - the process of using machine learning to identify one or more objects in an image and then classify each with a set of coordinates that indicates where the object is located in the image
Network - the Neural Network architecture used. Building a Neural Network for computer vision is usually reserved for specialized Machine Learning PhD's. Fortunately, there are many high-quality open source networks available. Some of the popular ones include YOLO, Inception, and Resnet. We will be using the Single Shot Multibox Detector (SSD) with MobileNets in this project.
Model - a model is a Network that has been trained on a specific dataset.
Training - for our purposes, this is the process of taking a bunch of labeled data, like a picture of a cat with a label file that says "cat" and the cat's coordinates, to build a model
Inference - running an image through the model to output a result

AIY Vision Kit - a self-contained vision system that is (comparatively) easy to use

If you are not familiar, the AIY Vision Kit is a $90 "do-it-yourself intelligent camera" from Google that you can get from Target (in the US at least). Aimed at developers and STEM educators, the Vision Kit is housed in a self-assembled cardboard. The hardware includes a Raspberry Pi Zero and a Google designed add-on board called the Vision Bonnet. The Bonnet includes features a special Intel-made processor made for handling the kinds of neural networks modern computer vision algorithms run on. For more on the Vision Kit, see the post I did on webrtcHacks covering the original version.

Fig 2. Unboxing the AIY Vision Kit

Cloud vs. Edge AI

Like nearly everything else, it is pretty common to run Computer Vision in the cloud. Microsoft, Amazon, Google, IBM, and others all have Cloud API's you can pay for. As I showed in another webrtcHacks post, it is also not too difficult to setup your own service using Open Source.

The cloud is easy, but is not always best for a few reasons:

Privacy - not everyone wants to send a live video feed to some remote destination
Latency - it takes time to send an image up, have it processed, and then get the results back
Costs - processing images (aka "running inference") in the cloud isn't free

The Vision Kit does what is known as "Edge AI". Inference actually runs on device and no Internet connection is required. In my case this means:

I don't have to worry about sending images/videos of my house to the wrong API account or who knows who else
I get instant inference, which is helpful when you want to your application to respond within a few 10's of milliseconds instead of seconds
I can process as many images as I want without worry how much it will cost - in fact a continuous 30 frames per second video stream

While this post will focus on the AIY Vision Kit, most of the concepts will apply to other embedded AI devices. You can actually buy a different kit from Intel that includes the same chip, just on a USB stick. Microsoft recently announced something similar with a new Qualcomm chip called the Snapdragon™ Neural Processing Engine (NPE).

Included Computer Vision Models

The kit comes with several models. Most are image classifiers:

Figure 3. The current models available for the AIY Vision Kit. Source: SlideShare

The AIY Kit also includes a "Dog / Cat / Human" object detector. Since I am looking to detect cats, one would think this would be a quick project. Unfortunately this built in object detector model did not do a great job of detecting my black cats. A bunch of the time it would not classify them at all. Other times I had miss-classifications like this where it mistook a mug full of markers for a person and totally missed the cat:

Fig 4. AIY Vision Kit Dog/Cat/Human annotated image after object detection - note the area highlighted with this result `Object #0: kind=PERSON(1), score=0.567637, bbox=(586, 208, 39, 65)`

Why doesn't it work?

Google has not published how they built this particular model, but it is likely they used pet data like what is found in The Oxford-IIIT Pet Dataset for their pet images when they trained the model. Looking at the pictures of some of the black cats, nearly all of them are on the ground with good lighting. None of them are on a kitchen counter. Perhaps the model just did not "see" enough examples that closely matched my kitchen. I also suspect black cats, particuarly at a distance, are hard because there are not a lot of easily identifiable features when you have low contrast lighting.

Since deep neural networks are somewhat of a black box, it is difficult to tell what is happening. However, this does not mean we need to give up.

Make it work with Tensorflow and Custom Training

The AIY Vision Kit, like all AI things Google, is based on Tensorflow, the most popular Machine Learning framework on the planet today. With some work, you can load your own models onto the Vision Kit to classify and/or detect whatever you want. Let's give this a try...

Fig 5. High-level process for custom training on the AIY Vision Kit

Process overview

We will be using Tensorflow's Object Detection API to train our own model. The steps are not for the faint of heart. This will take you hours, after you get the hang of it. Loading the model on the Vision Kit adds a few extra steps.

The whole process goes something like this:

Get a lot of images covering the classes you care about covering - a couple hundred for each class, ideally covering a variety of angles, backgrounds, and lighting conditions you will encounter in the real world
Label each image, by hand
Prepare the dataset for training - convert the labels to the appropriate format, make a label map file, and separate the images into a large training set and a smaller evaluation set
Set the right parameters inside the object detection configuration file
Prepare your Tensorflow environment - this can be another list of items if you are training in the Cloud like I did with ML-Engine
Run the training to produce a graph
Freeze the graph
Compile the graph using the Bonnet Compiler (in Ubuntu)
Load the compiled graph onto the Kit
Tweak the example code to make it work with the compiled graph
Fire up your code and hope it all works!

This process is generic for the Object Detection API up to 9. The specifics of the process depend on your environment. I recommend following along with some guides to do this. Two that helped me the most are:

I will give some highlights on steps 9-12 below.

Tips for AIY'ers

It was a long saga to get this all working on the Vision Kit. If you want to see the journey with valuable commentary from Googles and community contributions, check out this github issue.

How to generate a bunch of images quickly

The first challenge is getting a bunch of images. I was able to use the Motion Project to generate a bunch of images quickly. Motion is a Linux program that can run on the Pi. It looks for changes in a video camera image between frames and can save a snapshot. I set this up this program to take snapshots whenever motion was detected in the camera's field of view. I did this a bunch of times over a couple days, turning the camera to get different angles, raising its height, making sure the counter and kitchen table had different types of objects on it, and with a variety of lighting conditions.

My cats are very naughty, so it didn't take long for me to get a few hundred images. I also wanted to make sure I could classify people, so I let it run with people around too. In retrospect, I wish I took more people images since there is a lot more variability in people vs my cats who don't change outfits.

Motion includes a ton of parameters. One of my first mistakes was taking too many pictures - I ended up with a lot of pictures that were basically the same. Remember, you need to manually label each of these pictures. I found that including more images without a lot of variety just added to the work load without changing the accuracy much. I should have put a longer delay between taking snapshots.

Setting up your pipeline configuration file

For object detection on the AIY Vision Kit, you need to start with this configuration file: embedded_ssd_mobilenet_v1_coco.config

Then change the following lines:

num_classes - set this to the number of classes (counting from 1) for your custom model - in my case this was 2 (cat & person)
fine_tune_checkpoint - used if you are going to continue training from a saved checkpoint. Leave it out if you are going to train from scratch. More on this in a bit
input_path - path to your training record file
label_map_path - where you saved your label map file

Training on Google Cloud

I originally tried to do some training on my MacBook Pro. You usually need to run 10,000's of thousands of training steps up to 200,000. With no usable GPU for training on my Mac, sometimes a step could take up to 5 seconds. Doing the math, 100,000 steps would take me many days, not to mention nearly melt my machine from running at full tilt all that time.

So I then decided to move my training to Google Cloud's ML-Engine, mostly because that seems to be the best documented as part of the Object Detection API repo.

This involves a bunch more extra steps and some fiddling to make it work. When I first tried it, it would run for a while but would eventually fail due to memory issues (which I addressed by adjusting the batch size). Other times it would fail a few minutes in (I never fully figure out why). When this process is spread over a handful of machines and you are doing it a few times you can quickly burn through your Latte budget for the day without doing any real training.

In the end I found it simplest to just run on a single machine with the following:

gcloud ml-engine jobs submit training object_detection_eval_`date +%s` \
    --job-dir=gs://n5r8-cat-detector/train \
    --packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz \
    --module-name object_detection.eval \
    --region us-east1 --scale-tier BASIC_GPU --python-version 3.5 \
    --runtime-version 1.6 \
    -- \
    --checkpoint_dir=gs://{{my-gcp-bucket}}/train \
    --eval_dir=gs://{{my-gcp-bucket}}/test \
    --pipeline_config_path=gs://{{my-gcp bucket}}/data/ssd_mobilenet_v1_cat.config

I originally intended to let this run for a while to verify it works before switching to running it on multiple machines. Then, once it was working I was hesitant to mess with it (and I had other things to do), so I just let it run in the background.

My last run ran for 100,000 steps consuming 10.67 ML-units for a total cost of $5.30.

Running the Bonnet Compiler

You need to compile your graph to run on the Vision Bonnet. After you download the Bonnet Compiler, you'll need to find a place to run it. The Vision Kit page says:

The compiler works only with x86 64 CPU running Linux. It was tested with Ubuntu 14.04. You should NOT run it on VisionBonnet.

This is a pain for those that are running OS X (me) or Windows. Fortunately for me, I already had an Ubuntu instance setup on VirtualBox. Installing and configuring VirtualBox is a bunch more steps. If you do this, make sure you use the Virtual Box guest additions package to make it easy to pass files back and forth from the host machine to VM.

I am running Ubuntu 16.04 and it works fine for me.

If I was starting from scratch and just needed a quick Ubuntu environment to run the compiler then I probably would just use Docker. Now that I think about it I wish I tried that.

Compiler command

Here is the compiler command I used:

./bonnet_model_compiler.par \
    --frozen_graph_path=aiy_cat_detector.pb \
    --output_graph_path=cat_detector.binaryproto \
    --input_tensor_name="Preprocessor/sub" \
    --output_tensor_names="concat,concat_1" \
    --input_tensor_size=256

Note the output tensor names. This part was not obvious and one of the Googler's pointed it out to me.

AIY Object Detection code changes

If that all works, now you need to do some tweaks to the AIY Vision Kit object detection samples to make it work.

Change the Object Detection helper library

Make a copy of https://github.com/google/aiyprojects-raspbian/blob/aiyprojects/src/aiy/vision/models/object_detection.py and modify 3 lines:

Line 29 to modify the labels to match what you setup in your training
Line 75 to change the assertion here from 4 to match your number of labels
Line 84 - change the index here from 4 to match your number of labels

Write your program

Image Inference

I started out by modifying the object_detection.py for initial testing. All you need to do is change from aiy.vision.models import object_detection to your modified object_detection.py helper file like this:

from aiy.vision.inference import ImageInference
# from aiy.vision.models import object_detection
# Use my modified file instead
import aiy_cat_detection

Camera Inference

Lastly, I wrote/modified a program to run CameraInference. I added a couple other bells & whistles, like the privacy LED indicator and some sounds.

The full thing is just:

import argparse

from picamera import PiCamera
from time import time, strftime


from aiy.vision.leds import Leds
from aiy.vision.leds import PrivacyLed
from aiy.toneplayer import TonePlayer

from aiy.vision.inference import CameraInference
import aiy_cat_detection

# Sound setup
MODEL_LOAD_SOUND = ('C6w', 'c6w', 'C6w')
BEEP_SOUND = ('E6q', 'C6q')
player = TonePlayer(gpio=22, bpm=30)

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        '--num_frames',
        '-f',
        type=int,
        dest='num_frames',
        default=-1,
        help='Sets the number of frames to run for, otherwise runs forever.')

    parser.add_argument(
        '--num_pics',
        '-p',
        type=int,
        dest='num_pics',
        default=-1,
        help='Sets the max number of pictures to take, otherwise runs forever.')

    args = parser.parse_args()

    with PiCamera() as camera, PrivacyLed(Leds()):
        # See the Raspicam documentation for mode and framerate limits:
        # https://picamera.readthedocs.io/en/release-1.13/fov.html#sensor-modes
        # Set to the highest resolution possible at 16:9 aspect ratio
        camera.sensor_mode = 5
        camera.resolution = (1640, 922)
        camera.start_preview(fullscreen=True)

        with CameraInference(aiy_cat_detection.model()) as inference:
            print("Camera inference started")
            player.play(*MODEL_LOAD_SOUND)

            last_time = time()
            pics = 0
            save_pic = False

            for f, result in enumerate(inference.run()):

                for i, obj in enumerate(aiy_cat_detection.get_objects(result, 0.3)):

                    print('%s Object #%d: %s' % (strftime("%Y-%m-%d-%H:%M:%S"), i, str(obj)))
                    x, y, width, height = obj.bounding_box
                    if obj.label == 'CAT':
                        save_pic = True
                        player.play(*BEEP_SOUND)

                # save the image if there was 1 or more cats detected
                if save_pic:
                    # save the clean image
                    camera.capture("images/image_%s.jpg" % strftime("%Y%m%d-%H%M%S"))
                    pics +=1
                    save_pic = False

                if f == args.num_frames or pics == args.num_pics:
                    break

                now = time()
                duration = (now - last_time)

                # The Movidius chip runs at 35 ms per image.
                # Then there is some additional overhead for the object detector to
                # interpret the result and to save the image. If total process time is
                # running slower than 50 ms it could be a sign the CPU is geting overrun
                if duration > 0.50:
                    print("Total process time: %s seconds. Bonnet inference time: %s ms " %
                          (duration, result.duration_ms))

                last_time = now

        camera.stop_preview()

if __name__ == '__main__':
    main()

All my code and model is available on GitHub for reference.

Starting from Scratch vs. Transfer learning

Transfer learning is the idea of taking an existing model and quickly retraining it on a new set of classes. Rather than starting from scratch, you can just remove that last few layers of the neural network and then perform a relatively small number of training steps to produce a new model. The fine_tune_checkpoint above lets you do this.

Unfortunately, Google did not publish their model other than in its compiled binary form. Tensorflow does publish a number of models based on various sets that are useful for transfer learning. However, the particular MobileNet SSD configuration that the Vision Bonnet requires - a 256x256 input image with depthwise multiplier of 0.125 - is not one of them.

With no existing model to work from, I tried to train my model from scratch. This did not work well. It would sometimes think my oven was a person. I suspect my input images did not have enough variety for it to accurately distinguish between objects. I never really learned what features make an object.

Then zhoujustin saved the day and shared a trained model based on the 20-class VOC dataset. I set this as the fine_tune_checkpoint and my results have been pretty good.

The results

After struggling to get something to work, my new custom trained model works great!
Here is an example with very low light:

And here is one where my cat seems to be more attracted to the constant beeping my kit is making instead of running away from it:

I have some work to do here on the deterrent part, but at least now I have accurate detection. Next step is to get my robot involved. The bounding box coordinates will come in handy for aiming projectiles...

Conclusions

Ideally the built in AIY cat detector model would have worked for me out of the box, but that is really wishful thinking to expect it to work well in all circumstances. It is exciting to see you can get high accuracy results through custom training - even when inference is running on a low-power device.

It several months to make this work, but we got there eventually. Thanks to everyone on GitHub issue 314 for helping!

Now I need to get back to shooing my cats off the counter...

Chad Hart is an analyst and consultant with cwh.consulting, a product management, marketing, and and strategy advisory helping to advance the communications industry. He is currently working on a AI in RTC report.

Remember to Subscribe and Follow us.

Making IVRs not Suck with Dialogflow (Alexey Aylarov)

Alexey Aylarov — Tue, 29 May 2018 12:30:53 GMT

We are all getting used to asking Siri, Alexa, and Google to find things via voice. Alexa has more than 30,000 skills and smart speakers grew by 210% in Q1. Clearly voice assistants have matured, which is why it is insane that businesses still make their customers sit through tedious Interactive Voice Response (IVR) systems that rely on archaic touch-tone input and multi-layer menus. Sure, some high-end enterprises might have better systems, but, by-in-large, the IVR experience has not changed much in decades.

I believe this is about to change. I was commenting on Google's recent Duplex demo and Alexey Aylarov of VoxImplant mentioned they were just wrapping up an integration with Dialogflow - the conversational bot company formerly know as api.ai that Google bought and rebranded.

VoxImplant is a Communications Platform as a Service (CPaaS) with most of its team in Moscow. Alexey claimed they could do something similar to Duplex using public API’s (though without many of Google’s more advanced tricks). He has a good history of sharing the workings behind new concept demos, so I asked if he would share his experience and learnings integrating with Dialogflow here. Please see below for Alexey's high-level review of Dialogflow, more details on how VoxImplant went about the telephony integration, and for the more technical, some demos and code at the end.

Chad Hart, Editor

Making IVRs not Suck with Dialogflow

In most cases when we need to communicate with a business over the phone today we have to deal with a good old IVR. “Press 1 to …, press 2 to …, press 3 to …”. Just the initial options prompt itself can last for a few minutes. After you endure this first level of IVR menu comes a 2nd level, which can also lead to the 3rd one, and so on. It can be maddening. I haven’t met anyone who thinks that kind of customer experience is great. Fortunately alternatives to the IVR menu are becoming practical for a large audience. Thanks to the progress in machine learning and IT in general, speech recognition and voice bots have started to become accessible for business.

Rather than navigate through IVR menus, wouldn’t it be better if you could just say what you need/want, like you would with a real operator? That is totally possible with today’s technology natural language processing (NLP) and speech recognition technologies. Let's explore how we did this in our platform using Dialogflow.

Dialogflow

Google’s Dialogflow lets people create intelligent bots that understand natural language and can handle conversations like a live person. Dialogflow supports many nice features out of the box that are a good fit for commonly used in IVR interactions, such as:

Understanding intents - mapping a variety of spoken phrases to defined action
Slot filling - handling multiple data inputs per user utterance and knowing to keep asking when more information is required
Fallbacks - handling errors and avoiding looks if something is not understood
API and webhooks - for integration with external web services
Speech recognition - the new V2 API can directly handle speech input from Google’s Cloud Speech API

Dialogflow is a powerful tool that developers can quickly become familiar with after reading the docs and checking some examples. They make it easy to build rather complex bots without programming (or at least serious programming). If you are familiar with modern bot programming, Dialogflow works with number of common bot abstractions you need to set up for your agent to make it work Intents, Parameters, Contexts, Entities, and so on.

Choosing Dialogflow as an integration option was an easy decision for us. We already had a Google Speech integration (see below), so there were minimal technical challenges on our end. We also found it works well for our audience, and their documentation and examples are great, so we not think our users will have much trouble getting started with their API.

There are already a lot of articles and posts about Dialogflow and how to build various kinds of bots. Rather than rehash these guides, instead we will focus the rest of this post on how to integrate Dialogflow into a telephony system.

Telephony Considerations

Transcription Integration

Before API V2, Dialogflow agents could only work with text. That means you had to handle converting any speech to text yourself. Then you could pass that text to the API and receive response from an agent after NLU/NLP job has been done. In the new API V2 there is a StreamingDetectIntentRequest method that lets send audio directly to a Dialogflow agent. This method will transcribe the speech and then automatically process the transcription to create a response. It turns out this new method is basically the same as the Google Cloud Speech API.

You can still handle your own speech recognition, but is really easier to use Google’s methods. Google’s Cloud Speech API allows uploading of a recording, but you really should be streaming the audio to reduce latency and prevent awkward silences while audio is processed. Google uses the GRPC protocol for streaming your audio data in real-time to Dialogflow. Unfortunately I do not think there is any magic method for how to convert a live audio stream into GRPC. Ultimately you need some kind of converter that takes the stream from your system, chunks it up, and then sends it over GRPC. We had already integrated Google’s Cloud Speech API into Voximplant for automatic speech recognition with our own library, so it was relatively easy for us to add Dialogflow’s streaming API support.

Codecs

One of the first areas you will need to align-on is the audio encoding. Google supports Linear PCM, FLAC, G.711 PCMU, AMR/AMR-WB, OPUS in OGG container and Speex Wideband. Your telephony system definitely supports one of these, so it is just a matter of matching the codec whatever the caller is using. We do a lot of PSTN audio processing at our backend, so we usually use PCM. One could use an encoding that does more compression to reduce bandwidth consumption, but it is important to remember that additional transcoding/audio processing significantly affects recognition accuracy. Compressing the audio stream only makes sense if you already receiving the encoded audio in the format supported by Google’s backend.

Speech Duration Modes

Dialogflow/Google Speech supports two speech recognition modes:

Single utterance - recognition stops as soon as Dialogflow decides that user stopped talking and processes the sentence using NLU/NLP. (singleUtterance = true)
Long running - you send speech and recognition will continue until it’s stopped using the API so everything is transcribed and sent through the NLU/NLP engine. (singleUtterance = false)

The long running mode also gives interim recognition results that arrive in StreamingRecognitionResult objects. These can be used by the NLU engine to predict an intent before speech even stops, like what happens with a person.

Returned Result

After NLU/NLP is done, the Dialogflow agent returns a QueryResult object populated with some data - action, intent, contexts, etc. (see the next section for a code example). For example, it can be used by telephony backend to say something using Text-to-Speech (TTS) engine or to change IVR branch, or to connect with a live person, etc. We had an existing Text-to-Speech capability for the many languages we support, but recently added Google’s Text-to-Speech with WaveNet-powered voices. The WaveNet TTS engine sounds very realistic, but unfortunately it is only available for US English today. We do expect that new languages will appear in the not too distant future. I also expect that Dialogflow will support sending audio generated by Google’s TTS at some point too.

Demo & Code

If you want to try it in action you can play with our pizza order & delivery phone bot here:

Dialogflow allows you to import a project, which makes it easy to get started by modifying an existing application through the GUI. You can grab the Dialogflow agent used in this demo here: PizzaOrderDelivery.zip.

You can see from the demo that some of the interpreted information is relayed back to the webpage as a visual confirmation. To do this we setup a NodeJS proxy in our backend to send the interpreted responses from Dialogflow to the local browser. We plan to add websocket support to our API engine to make this easier.

On our side, we use something like the below cloud function to process the returned QueryResult object and respond with Text-to-Speech:

    function sendMediaToDialogflow() {
      call.removeEventListener(CallEvents.PlaybackFinished)
      call.sendMediaTo(dialogflow)
    }

    function onDialogflowQueryResult(e) {
      if (e.result.fulfillmentText !== undefined) {
        call.say(e.result.fulfillmentText, Language.Premium.US_ENGLISH_FEMALE)
        call.addEventListener(CallEvents.PlaybackFinished, sendMediaToDialogflow)
      }
    }

Many of the function calls are obviously specific to our implementation, but one can see how similar logic could be used implement something like Google’s Duplex demo for outbound calls too!

Alexey Aylarov, CEO VoxImplant