The Voice Musings #1: Designing For Voice When Alexa Doesn’t Really Understand

How to design a good VUI (Voice User Interface) experience, despite current technological limitations

8 min readMar 8, 2018

The Amazon Echo, along with some blueberries; Image: Amazon

In 2015, Sabre Labs explored VUIs (Voice User Interfaces) and created the Lulu prototype, a voice-enabled assistant for travel managers that runs on the Amazon Echo. And when Amazon debuted the Echo Show last summer, we updated Lulu to explore the unfamiliar intersection of VUIs and GUIs (Graphical User Interfaces).

One of questions we asked during these explorations was, “What makes voice interfaces unique?” Throughout our explorations, a few themes emerged.

VUIs flatten navigation, with flexible entry points.

There are a myriad of answers to the question, but one answer that had us digging deeper was that VUIs “flatten cumbersome navigational structures.” Directly accessing a question in FAQs, applying filters through natural language, and reaching functionalities that would be buried few levels deep in traditional GUIs are a few examples.

This flattening of navigation through language is not a VUI-specific trait; a similar shift already happened decades ago with web discovery and navigation. Before the invention and refinement of the query-based search bar we use today, web navigation often relied on hierarchical links that needed to be progressed through, level by level.

File directories to now—we’ve come a long way. One could consider the search bar a type of a conversational interface.

So flattening navigation isn’t really breaking news in today’s conversational revolution. What is more interesting, however, is how conversational interfaces today are flattening navigation: through flexible entry points.

A Solution to the Discoverability Problem: Multiple Entry Points

Conversational interfaces are often opaque and provide little or no indication of the functionalities or the architecture of the app as traditional GUIs do. Consequently, one of the biggest problems with conversational interfaces today is that many of them often devolve into fancy command lines.

Users struggle when they are required to remember specific words or phrases to use a conversational interface. In Sabre Labs, we refer to this phenomena as the discoverability problem.

Again, this is not a new problem. Macs have spotlight and Windows now has Cortana where you can directly access programs via a search bar. Theoretically, this is more efficient than navigating to a program’s location and clicking the icon. But most people still access applications through navigation, and it’s not hard to see why.

When I type “internet” into Spotlight, I don’t get Safari or Chrome or Firefox. I see this:

Esc. Esc. Esc. Oh wait, I don’t have that key anymore.

Many VUIs fail for the same reasons that most users prefer not to open up the command line: People don’t remember details — they remember concepts, the big picture. And they’re definitely not going to remember the three commands required to use your app.

The discoverability problem could refer to a couple of scenarios:

Scenario 1. The user is unable to even recall how to invoke the voice app/skill at all, similar to how you forget apps you’ve downloaded on your phone, but this time you have to remember exactly what the name of the app is.
Scenario 2. The user is unable to use a conversational interface effectively or at all because there are no clear ways to discover the app’s functionalities and/or the app’s functionalities are difficult to remember.

The solution to both scenarios is to have flexible entry points, whether it be to launching an app or to reach a functionality. Unfortunately, Scenario 1 is outside of our control, as the experience of invoking apps is tied to the platform. All we can do is try to make our apps have a catchy, easy-to-say name.

I’ve mused about smart utterances for app launch—for example, if a user has only have one health app installed, s/he should be able to say anything health related and get that app. But that would require some fancy programming and possibly AI... which leads us to Scenario 2.

Ideally, to solve Scenario 2, flexible entry points would be resolved by a NLP (natural language processing) engine advanced enough to organically parse and understand all users requests as a human would. We’re not there yet.

So for now, we have to mitigate this problem through other methods. On Amazon Alexa platform, this is done through something called intents and utterances—through this model, we can build flexible entry points into our app even with current NLP technology.

A Primer on Intents and Utterances

For readers who are not familiar with building third-party voice apps, i.e. “skills” on Alexa, here’s a quick explanation.

Let’s say that you want to find flights that connect in a certain city. This is an intent, which I named “FlightsByConnection” as shown below. We can express this intent in many different ways.

We could say, “What flights connect in New York City?”, “Find a flight with a layover in NYC,” or even, “I need to connect in New York.” These commands, dubbed utterances by the Alexa platform, are the multiple entry points. As you see, the more utterances there are, the easier it is to access the intent.

This is how most conversational interfaces are structured today. The meaning of voice inputs aren’t processed as a human would, but this model still facilitates navigation and mitigates the discoverability problem by providing multiple, flexible entry points to an intent.

Context-keeping is the secret sauce, but it’s difficult to make.

What’s more difficult—but incredibly powerful—is maintaining context during all this.

Imagine, in the previous scenario, that you actually need to find flights that only connect in JFK. Ideally, you should be able to follow up the previous request with something short and sweet like, “How about just JFK?” and be understood. But without context, this simple query doesn’t mean anything.

Dealing with context isn’t easy. It brings up tough questions like, “How do you even program context?”, “When do you know that a context has switched?”, “How long should this piece of info be maintained?”, “How can we resolve pronouns?”, and other concerns that are being worked on by really smart people.

Context-keeping is not a solved problem, but it’s happening—slowly but surely. Google Home and Alexa are both contextually aware to a degree, as CNET reports:

Amazon has improved Alexa’s contextual awareness to an extent. Ask about the weather on Thursday, and Alexa will respond accurately. Say, “How about Friday” and Alexa will understand you’re still talking about the weather. Google takes contextual awareness one step further. If you ask who plays Katniss Everdeen, then ask “what else is she in?” Google will get both questions right, knowing you mean Jennifer Lawrence when you say “she.”

Context Independent vs. Context-Dependent Commands

Beyond technological constraints, context-keeping itself comes with a host of considerations, one of which is the concept of context-independent versus context-dependent commands.

Context-independent, top-level commands are easy. “Find flights connecting in JFK” as a context-independent command should return all the flights connecting in JFK. But what if this command were context-dependent?

With context as a variable, two identical commands could mean something wildly different:

Left: an example of what context-keeping may look like; Right: illustrating how, depending on context, the same command returns different results.

Let’s say that the user first asked, “Find flights to DFW tomorrow,” and then said, “Find flights connecting in JFK”? The user should receive flights connecting in JFK that are destined for DFW. This is impossible to do without context-keeping.

Unfortunately, as mentioned before, context-keeping is difficult to do. Voice experiences will be bumpy for a while until the technology matures, but for now, as designers, we have to work with these limitations to make the ride as smooth as possible, even with those bumps.

Final tips for designing VUIs within current technological limitations

Find ways to establish the mental model of the app for your user early on. Listing out all the functionalities your app has in the welcome message is the simplest and the most common way, but also look for spots in your app where you could insert discoverability in a more natural way.
Keep the navigation flat as possible, and stay away from overly nested architectures. Asking users to choose from a list of menus more than once or twice in a row can be disorienting in a conversational interface. Keep the user flow linear and prune the conversational branches as much as possible. (More on that here: 11 More Best Practices for Building Chatbots — refer to #10)
Provide multiple entry points to a functionality. Brainstorm and observe through user testing all the different ways a functionality could be invoked. (Caveat: if you’re building skills for Alexa, take care to not have the intents overlap, as providing too many generic utterances for an intent may confuse the NLP engine.)
Design a conversational experience that doesn’t require context-keeping—at least not complex ones. It’s best to provide the users with top-level commands that have less room for ambiguity and are less error-prone.
If you do need to utilize context, don’t be afraid to ask clarification questions where ambiguity may occur in the user flow. Verify with the that the context is still valid and that the user still wants to keep the context. Take care to strike a balance, as too many verification questions will quickly tire the user.

Looking Forward

Despite its current shortcomings, we expect VUIs to mature and catch on rapidly. The technology is new and immature, but by preemptively addressing the technological limitations, we can still create a positive voice experience for our users.

As natural language processing, context-keeping, and machine learning advance, VUIs will be able to simplify and streamline increasingly complex interactions, opening the door for deeper and more meaningful experiences in our lives.

This article is part of Sabre Labs team’s analysis on emerging technology trends impacting travel. Please reach out to Sabre Labs at sabrelabs@sabre.com for further inquiries or discussions on the future of technology and travel.