##TL;DR STOP dispatching intents to a bunch of little functions each of which contains app logic, state management, and speech production. INSTEAD, structure your skill code to be easier to understand, test, and change by using a simple ‘plan’ datastructure that binds together key functions:
theplan = plan(state, intent)
state = update(state, theplan)
output = speak(theplan)
When I learned that Amazon was opening up Alexa to developers, I was pretty excited. I’ve been waiting for Apple to enable 3rd party development for Siri since it showed up, and there is as yet no indication that that is coming.
Of course, getting access to the docs and resources to build the skill is not the same as actually building a skill. For that, you have to offer me a t-shirt. Thankfully, Amazon did.
For my first skill, I decided to create Film Buff. The skill would allow you to ask questions like “What films star both Uma Thurman and John Travolta.” Frankly, I was amazed that I didn’t see this skill already in the gallery of third party skills. I spend 14 hours over the course of a few days putting it together thinking, wow, this is going to be very cool. Then while I’m testing with the live unit, I asked Alexa, “What film stars both Uma Thurman and John Travolta.” She answered with 3 movies. In my test code I was seeing only one answer (Pulp Fiction). WTF? Turns out, this functionality is built into Alexa already. I was asking the built in service, not my own, which would have required me to ask “Alexa, ask Film Buff what film starred…”. Doh! And this was the morning of December 18th, the stated cutoff date for getting a t-shirt.
So, back to the drawing board. I’d already invested some time in creating a skill, so ignoring the sunk cost fallacy I figured I should get something out of that. I decided to create a skill that would quiz users on U.S. State capital city names. Thus was born Capital Quiz, which Amazon certified and made live on 2015-12-29. I’ve made a few enhancements (the biggie - between-session storage of state on S3) and deployed v1.3 on 2016-01-12. As of 2016-01-21, the skill has now had 8,000+ users and caused 300k+ utterances. And I got an Alexa Dev t-shirt.
##Pervasive Skill Architecture
Amazon’s sample programs dispatch intents into separate functions, each of which takes the service request as input and returns the speechlet response. All the Alexa tutorials and and frameworks I’ve seen take this same strategy. It seems useful to do this - we have to dispatch on intent at some point - but doing it first thing spreads awareness of the form of request and response too widely in the app.
Consider what it’s going to take to test these intent functions. Each test have to provide valid input by packaging up app state and an intent into a simulated request object. In order to check correct behavior, the test code will have to unpack the speechlet response and ensure that the correct changes were made to outgoing state and that the generated speech and card are correct.
Whoa. There’s too much going on there. I’m never going to get that right. And creating a skill that supports natural seeming discourse is difficult enough already.
##The Planned Alternative
What we want is to separate decision, state change, and speech output functions so that they can be considered independently.
Here’s the primary flow of a skill that is designed with this goal:
Things to note:
plan(state, intent)is simple to reason about because it does not generate speech. Many intents yield fixed plans like
["welcome", "prompt"]or just one of two options.
update(state, theplan)is simple because each plan operation affects state in a clear way. Many operations have no effect on state!
speak(theplan)isolates speech response production and inherently handles compound speech.
speak()are totally independent. In a threaded environment that could improve performance. More importantly, independent means less complex, more testable. Do I have a theme here?
Take a look at the complete working version of Film Buff here
=>’s after having written and tested the code locally with a more current version of Node.js.
Further note on the code: I copied some of the basic blocks of code below and included excerpts of the test code. It looks like a lot is going on, but I’m loath to trim away the details. It’s like the problem with the Facebook view of the world - in posts it looks like everyone is having a great time and your own life looks messy in comparison. Similarly, instructional code often elides error handling and boilerplate and therefore looks simple and pure. In comparison your own working code looks overwrought and messy. But that’s often what real working code looks like. Not to suggest you shouldn’t apply any and all strategies to streamline your working code. Got it?
Full disclosure. I wrote Capital Quiz in ClojureScript wherein immutable maps and
In my implementation, the
state is an instance of Immutable.Map and
[["say-films", 1001, 1020, 1050], "main-prompt"]
[["say-score" game-state], "say-goodbye", "end-session"]
plan() function looks like this. Note that for
TwoActors intents I call out to planning sub-functions because the logic is more than a couple lines. Even in those cases, this is the only place where
slots are referenced. Could you use a more declarative system to express this dispatch table? Sure! But the important point would be to dispatch to functions that looked only at state and intent and returned a plan. Don’t do more!
plan() looks like this:
update() function is as follows. State changes are no-ops or a few simple lines using only plan operands.
update() looks like the following. In, out, simple.
speakOp()) is the biggest dispatcher. Keep in mind, we’re not dispatching on intents, we’re dispatching on plan operations. Again, could you put these operations into functions and use a map dispatcher? Sure! Notice that that
getMoviesSpeech() is just such a call. But given that many are one-liners, it’s not entirely clear what the benefit would be. Note the critical call to
combineSpeech() each time through the loop.
speak() looks like the following.
##Conclusion The opportunity with Alexa is to create a lots of interesting skills. The potential for great complexity is great. Before more skills get created, we need to consider different choices for how they’re architected. I hope the strategy described here is compelling or inspires readers to consider even more alternatives.