Alexa Skill Architecture

##TL;DR STOP dispatching intents to a bunch of little functions each of which contains app logic, state management, and speech production. INSTEAD, structure your skill code to be easier to understand, test, and change by using a simple ‘plan’ datastructure that binds together key functions:

theplan = plan(state, intent)
state = update(state, theplan)
output = speak(theplan)

##Backstory

When I learned that Amazon was opening up Alexa to developers, I was pretty excited. I’ve been waiting for Apple to enable 3rd party development for Siri since it showed up, and there is as yet no indication that that is coming.

Of course, getting access to the docs and resources to build the skill is not the same as actually building a skill. For that, you have to offer me a t-shirt. Thankfully, Amazon did.

For my first skill, I decided to create Film Buff. The skill would allow you to ask questions like “What films star both Uma Thurman and John Travolta.” Frankly, I was amazed that I didn’t see this skill already in the gallery of third party skills. I spend 14 hours over the course of a few days putting it together thinking, wow, this is going to be very cool. Then while I’m testing with the live unit, I asked Alexa, “What film stars both Uma Thurman and John Travolta.” She answered with 3 movies. In my test code I was seeing only one answer (Pulp Fiction). WTF? Turns out, this functionality is built into Alexa already. I was asking the built in service, not my own, which would have required me to ask “Alexa, ask Film Buff what film starred…”. Doh! And this was the morning of December 18th, the stated cutoff date for getting a t-shirt.

So, back to the drawing board. I’d already invested some time in creating a skill, so ignoring the sunk cost fallacy I figured I should get something out of that. I decided to create a skill that would quiz users on U.S. State capital city names. Thus was born Capital Quiz, which Amazon certified and made live on 2015-12-29. I’ve made a few enhancements (the biggie - between-session storage of state on S3) and deployed v1.3 on 2016-01-12. As of 2016-01-21, the skill has now had 8,000+ users and caused 300k+ utterances. And I got an Alexa Dev t-shirt.

##Pervasive Skill Architecture

Amazon’s sample programs dispatch intents into separate functions, each of which takes the service request as input and returns the speechlet response. All the Alexa tutorials and and frameworks I’ve seen take this same strategy. It seems useful to do this - we have to dispatch on intent at some point - but doing it first thing spreads awareness of the form of request and response too widely in the app.

Consider what it’s going to take to test these intent functions. Each test have to provide valid input by packaging up app state and an intent into a simulated request object. In order to check correct behavior, the test code will have to unpack the speechlet response and ensure that the correct changes were made to outgoing state and that the generated speech and card are correct.

Whoa. There’s too much going on there. I’m never going to get that right. And creating a skill that supports natural seeming discourse is difficult enough already.

##The Planned Alternative

What we want is to separate decision, state change, and speech output functions so that they can be considered independently.

Here’s the primary flow of a skill that is designed with this goal:

state = recover(session, [storage], [defaultState])
thePlan = plan(state, intent)
state = update(state, thePlan)
speechlet = speak(thePlan)
if (speechlet.shouldEndSession) { storage.store(state) }
context.success(buildResponse(state, speechlet))

Things to note:

plan(state, intent) is simple to reason about because it does not generate speech. Many intents yield fixed plans like ["welcome", "prompt"] or just one of two options.
update(state, theplan) is simple because each plan operation affects state in a clear way. Many operations have no effect on state!
Using speak(theplan) isolates speech response production and inherently handles compound speech.
Lastly, update() & speak() are totally independent. In a threaded environment that could improve performance. More importantly, independent means less complex, more testable. Do I have a theme here?

##Code

Take a look at the complete working version of Film Buff here

Note on the code: It’s older style JavaScript because AWS Lambda where it’s hosted runs Node.js v0.10.36 (as of 20160122). Yes, it was a PITA to go back to all those line terminating semicolons and erase =>’s after having written and tested the code locally with a more current version of Node.js.

Further note on the code: I copied some of the basic blocks of code below and included excerpts of the test code. It looks like a lot is going on, but I’m loath to trim away the details. It’s like the problem with the Facebook view of the world - in posts it looks like everyone is having a great time and your own life looks messy in comparison. Similarly, instructional code often elides error handling and boilerplate and therefore looks simple and pure. In comparison your own working code looks overwrought and messy. But that’s often what real working code looks like. Not to suggest you shouldn’t apply any and all strategies to streamline your working code. Got it?

Full disclosure. I wrote Capital Quiz in ClojureScript wherein immutable maps and [:vectors :of [:keywords 123]] look cleaner than they do in JavaScript. (Also, since I usually consider JavaScript to be a compilation target (ever used GWT?) my JavaScript is probably really bad.) I rewrote Film Buff using the ideas from Capital Quiz in order to present these ideas without giving people paren-shock. That said, I highly recommend using ClojureScript. Every time you expand your head, it’s a good thing.

In my implementation, the state is an instance of Immutable.Map and theplan is just a JavaScript Array of operations. If an operation requires parameters, I represented it as a nested Array with the first value as the operation and subsequent values as operands.

Plan examples:

["welcome", "main-prompt"]
[["say-films", 1001, 1020, 1050], "main-prompt"]
[["say-score" game-state], "say-goodbye", "end-session"]

#Plan(state, intent) The plan() function looks like this. Note that for OneActor and TwoActors intents I call out to planning sub-functions because the logic is more than a couple lines. Even in those cases, this is the only place where slots are referenced. Could you use a more declarative system to express this dispatch table? Sure! But the important point would be to dispatch to functions that looked only at state and intent and returned a plan. Don’t do more!

function plan(state, intent) {
  switch (intent.name) {
    case "INTERNAL.Launch":
      return ["say_welcome", "prompt_two", "forget_actor"];
    case "INTERNAL.Ended":
      return ["end_session"];
    case "OneActor":
      var actor1 = intent.slots.ActorOne.value;
      return planOneActor(state, actor1);
    case "TwoActors":
      var actor1 = intent.slots.ActorOne.value;
      var actor2 = intent.slots.ActorTwo.value;
      return planTwoActors(actor1, actor2);
    case "AMAZON.HelpIntent":
      return ["say_help"];
    default:
      return [["error"], "Unknown intent '" + intent.name + "'"];
  }
}

Testing plan() looks like this:

    it('should respond to launch with welcome', function () {
      thePlan = lambda.test.plan(Immutable.Map(), {name: "INTERNAL.Launch"})
      thePlan.should.eql(["say_welcome", "prompt_two", "forget_actor"]);
    });
    it('should prompt for second when unknown actor but one stored', function () {
      thePlan = lambda.test.planOneActor(Immutable.Map({actor: UMA_ID}), 'Emily Litella')
      thePlan.should.eql(["unknown_actor", ['prompt_second', UMA_ID]]);
    });
    it('should succeed when known actors', function () {
      thePlan = lambda.test.planTwoActors('John Travolta', 'Uma Thurman')
      thePlan.should.eql([
        ['store_question', JOHN_ID, UMA_ID],
        ['answer_movies', 680], 'end_session']);
    });

#Update(state, thePlan) The update() function is as follows. State changes are no-ops or a few simple lines using only plan operands.

function update(state, thePlan) {
  return thePlan.reduce(function (s, op) {
    var opcode = getOpcode(op);
    switch (opcode) {
      case "prompt_second":
        return s.set("actor", op[1]);
      case "store_question":
        return s.update("history", Immutable.List(), function (h) {
          return h.push(Immutable.List.of(op[1], op[2]));
        });
      case "forget_actor":
      case "end_session":
        return s.delete("actor");
      default:
        // Do nothing. Most thePlan ops don't affect state
        return s;
    }
  }, state);
}

Testing update() looks like the following. In, out, simple.

  it('should set actor on prompt_second', function () {
    state = lambda.test.update(Immutable.Map(), [['prompt_second', JOHN_ID]])
    state.toJS().should.eql({actor: JOHN_ID});
  });
  it('should store question to empty history', function () {
    state = lambda.test.update(Immutable.Map(),
      [['store_question', UMA_ID, JOHN_ID], "end_session"])
    state.toJS().should.eql({history: [[UMA_ID, JOHN_ID]]});
  });
  it('should not change state with most plans', function () {
    state = lambda.test.update(Immutable.Map({actor: UMA_ID, history: Immutable.List()}),
      ["say_welcome", "say_help", "answer_movies", "neither_known", "unknown_actor", "did_not_catch",
       "unknown_movie", "same_actor", "prompt_two"])
    state.toJS().should.eql({actor: UMA_ID, history: []});
  });

#Speak(thePlan) speak() (speakOp()) is the biggest dispatcher. Keep in mind, we’re not dispatching on intents, we’re dispatching on plan operations. Again, could you put these operations into functions and use a map dispatcher? Sure! Notice that that getMoviesSpeech() is just such a call. But given that many are one-liners, it’s not entirely clear what the benefit would be. Note the critical call to combineSpeech() each time through the loop.

function speakOp(op) {
  var opcode = getOpcode(op);
  switch (opcode) {
    case "say_welcome":
      return getWelcomeSpeech();
    case "say_help":
      return getHelpSpeech();
    case "answer_movies":
      return getMoviesSpeech(op.slice(1));
    case "neither_known":
      return {output: "I don't recognize either of those actors."};
    case "unknown_actor":
      return {output: "I don't recognize that actor."};
    case "did_not_catch":
      var which = op[1]===1 ? "first" : "second";
      return {output: "I don't recognize that " + which + " actor."};
    case "unknown_movie":
      return {output: "I don't know a movie they were both in."};
    case "same_actor":
      return {output: "That's the same actor twice."};
    case "prompt_two":
      return {output: "Ask what films two actors were in."};
    case "prompt_second":
      return {output: movies.getPersonName(op[1]) + " and who else?"};
    case "end_session":
      return {shouldEndSession: true};
    default:
      // Do nothing. Often thePlan ops don't produce output.
      return null;
  }
}

function speak(thePlan) {
  return thePlan.reduce(function (speech, op) {
    var sp = speakOp(op);
    if (sp) {
      speech = combineSpeech(speech, sp);
    }
    return speech;
  }, {});
}

Testing speak() looks like the following.

  it('should end on end_session', function () {
    speechlet = lambda.test.speak(['end_session']);
    speechlet.shouldEndSession.should.eql(true);
  });
  it('should produce welcome speech', function () {
    speechlet = lambda.test.speak(['say_welcome'])
    speechlet.output.should.eql(lambda.test.getWelcomeSpeech().output);
  });
  it('should produce did_not_catch speech', function () {
    speechlet = lambda.test.speak([['did_not_catch', 1]])
    speechlet.output.should.eql("I don't recognize that first actor.");
    speechlet = lambda.test.speak([['did_not_catch', 2]])
    speechlet.output.should.eql("I don't recognize that second actor.");
  });
  it('should pronounce three+ movies', function () {
    speechlet = lambda.test.speak([['answer_movies', 680, 702, 227, 240]])
    speechlet.output.should.eql('They were in Pulp Fiction and A Streetcar Named Desire and 2 others.');
  });

##Conclusion The opportunity with Alexa is to create a lots of interesting skills. The potential for great complexity is great. Before more skills get created, we need to consider different choices for how they’re architected. I hope the strategy described here is compelling or inspires readers to consider even more alternatives.