Less than two weeks ago I ran the Kranky Geek AI in RTC conference with Tsahi Levent-Levi and Chris Koehncke. Historically we covered mostly WebRTC content but last year we decided to add some AI and Machine Learning topics. This year we focused most of the event on AI. Nearly all of this content is highly relevant to the topics covered by this blog, so below is a summary of the most relevant talks with highlights and links.
Much like our report, we segmented the AI in RTC talks into 4 main areas:
- Speech analytics – speech to text with Machine Learning analysis of the waveform and transcript
- Voicebots - automated programs that interact with users in a conversational dialog using speech as input and output like Siri, Alexa, Cortana, etc.
- Computer Vision – processing video to analyze and understand what is seen
- RTC Optimization - machine learning methods used to improve service quality or cost performance
I have details on the talks in each of these areas below.
This is a big area and we included three talks – two from services that make heavy use of speech analytics – Voicera and Dialpad - and one that provides a transcription and speech analytics API – Voicebase.
The branch of linguistics which studies non-phonemic aspects of speech, such as tone of voice, tempo, etc.; non-phonemic characteristics of communication; paralanguage.
Perhaps an better definition for paralinguistics as provided by Voicebase’s CTO, Jeff Shukis:
How something is said, distinct from what is said
In this talk, Jeff walked through how Voicebase's investigation of paralinguistic data and where they ended up. In summary, they ended up capturing the two most dominant frequencies on a per-word level along with a relative energy metric. They also look at the relative volume level and total time spoken for each word to inform on speech rate. For end-customer applications, they end up rolling this into some aggregate metrics across the conversation to look for meaningful changes in tone and volume. The relevance of any one of these features is really determined when all the data is fed into a machine learning model that maps the call data against customer provided agent and call outcome data.
Dealing with custom jargon
Every industry and business has its own custom vocabulary. Words like “WebRTC” don’t show up in a standard dictionary. Most personal and company names are also hard for speech engines. Unfortunately, these terms often end up being some of the most meaningful words for understanding what was said. Dialpad talked about how they address this with a technique they call Domain Adaptation. Etienne Manderscheid, their VP of Machine Learning, gave a step-by-step example on how to implement domain adaptation with Kaldi. Etienne actually agreed to do a complete cogint.ai post with a complete code walkthrough, so I’ll save the details for that.
As I covered here before, IVRs and contact centers are huge new opportunity areas for voicebots. Nexmo and IBM both actually covered this topic.
Contact Center Voicebot architectures and challenges
Brian Pulito of IBM gave more of an architectural talk, setting up some of the challenges contact centers face with traditional technologies and where voicebots and other AI tools fit. Brian also walked through some the challenges in implementing these systems, including:
- Dealing with noise
- Handling dialects and custom vocabularies
- Voice authentication
- Handling latency
- Using SSML for more natural speech synthesis
- Slot filling
- Intent training time
Integrating with Dialogflow
We have covered telephony integration with Dialogflow on several different platforms, including Dialogflow’s own Phone Gateway, VoxImplant, and SignalWire. Nexmo gave a walkthrough of their approach for this. Unlike VoxImplant and SignalWire that have a native gRPC integration that sends audio over IP to Dialogflow, Nexmo walked through how to simply forward an incoming call to Dialogflow’s phone gateway. While less than ideal from a cost (you make an extra outbound call) and quality perspective (you add another leg through the PSTN), this is actually very easy to do. In addition, Dialogflow’s ability to add a webhook for fulfillment means you can do a better job of handling call transfers than the native Phone Gateway allows.
We had a several talks covering Computer Vision (CV). A couple of talks covered many topics, but included some interesting CV-related tidbits:
- Intel talked about some of their tools for helping with computer vision in the cloud
- Microsoft showed some cool videos showing Mixed Reality broadcasts with the HoloLens
Then we had a few that were entirely focused on CV.
Person detection inside Facebook’s Portal
Facebook recently launched a dedicated video chat device that leverages Facebook Messenger. Portal is mean to be placed in a stationary location inside the home. To make sure it captures the appropriate action, no matter where the call participants are located in the room, it utilizes computer vision identify people and frame them properly.
Facebook is frequently in the news about privacy issues, so a Facebook camera that tracks people has been met with a lot of skepticism. We were lucky that Facebook let Eric Hwang and Arthur Alem talk about the portal implementation at all, even if they weren’t allowed to go into deep detail on most of it. They did show examples of the CV features and discussed the challenges of running their vision algorithms on a constrained device in a latency sensitive application. They also talked a bit about their WebRTC implementation with simulcast and how that applies with user-based video selection and calling to non-Portal Messenger clients.
We also had a few deeper Computer Vision talks from Agora.io and Houseparty. They both focused on used CV algorithms to improve media quality, so I will cover them in the next section.
In our report, we were surprised by the relative lack of ML activity for improving media quality and delivery. Talks at the AI-in-RTC made me feel a lot better about this. Houseparty and Agora.io both covered improving video quality and we also had a couple of talks from Callstats.io and RingCentral on using Machine Learning to make better sense of the metadata that accompanies calls.
Super-Resolution on Real Time Communications Video
In the day’s most technical talk, Shawn Zhong of Agora went into details on their super resolution machine learning model. Super-resolution imaging is a set of techniques for improving how an image looks by enhancing a its resolution. Your camera always captures at a high resolution, but in RTC this resolution is reduced by the video encoder to match available bandwidth and processing constraints. The image quality is often then further reduced due to packet loss during transmission. Super-resolution aims to restore the video quality back to the original before the degradation.
Tony talked about how they use a Generative and Adversarial Network (GAN) to address this problem. He spoke in detail on how GAN’s work here and the challenges of optimizing heavy Deep Neural Networks (DNN) for use on a typical mobile device.
Open Source ML to Improve Video Quality
Gustavo Garcia Barnado previously analyzed the use of Google’s ML kit to provide smile detection on iOS for webrtcHacks. For his Kranky Geek AI in RTC talk, Gustavo shared some experiments he did using machine learning to improve video quality. This is super-resolution again. Unlike Shawn who approached the super-resolution program from a low-level Machine Learning perspective, Gustavo used an Artifacts Reduction Convolutional Neural Network (AR-CNN) model he found for Tensorflow to see what would happen. When it worked they migrated it to work on CoreML for Apple devices.
If you are a Machine Learning expert like Shawn, you can build an optimized end-to-end model. But if you aren’t an ML PhD, Gustavo shows how you can still get great results leveraging open source libraries.
Analyzing Call Statistics with Unsupervised Learning
For WebRTC deployments, the RTCStats API is a great way to collect a lot of data. However, in many ways it has become too good and it is easy to get lost in all the data collected. Varun Singh of Callstats showed how they used Principal Component Analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE) for clustering users with similar issues together. Varun walks through a real word example where they were able to identify some specific ISPs that were causing issues for users across different WebRTC apps.
Using ML to normalize call quality data
In VoIP systems, Call quality is generally measured by end user and intermediary devices using a metric known as Mean Opinion Score (MOS). RingCentral noticed all of these devices measured MOS slightly differently. This created a lot of operational headaches whey trying to identify and troubleshoot call quality issues. Curtis Lee Peterson of RingCentral’s operations team talked through how they took more than a million data records and ran them through a Tensorflow model to provide normalized, accurate MOS data.
See the complete Kranky Geek AI in RTC playlist here. The list incudes a few additional WebRTC-oriented talked not incuded above including Google's annual talk and one from Discord describing how they adapted the WebRTC stack to handling 2.8 Million concurrent callers in their gamer chat app.
Make sure to subscribe for future videos.
Our next Kranky Geek event in San Francisco is scheduled for November 15, 2019, so mark your calendars now!
About the Author
Chad Hart is an analyst and consultant with cwh.consulting, a product management, marketing, and strategy advisory helping to advance the communications industry. He recently co-authored a study on AI in RTC - check it out at krankygeek.com/research.(https://krankygeek.com/research).