AirPods, Alexa, and the Rise of Voice


I was in the middle of writing another post for this blog when I noticed that there was a lot of chatter going on regarding Apple’s newly released AirPods. Some have called them “magical” and the level of technology that has gone into them definitely seems to be vintage Apple. I figured I’d take a break from econ and sprinkle in some tech for my readers.

AirPods aren’t just run-of-the-mill Bluetooth earbuds, it appears that Apple is very much treating them as a new interaction model. Airpods aren’t just a passive peripheral but a new human-computer interface for the iPhone. In conjunction with Watch, Apple seems to be trying to contemplate usage beyond the smartphone screen. In this new computing paradigm, voice, and thus Siri, is finally set to become the primary user interface.

Though Siri has been around for awhile, it is Amazon’s Alexa that has shown the tech industry the way to using voice effectively as a user interface. Before Amazon Echo, companies largely focused on shoehorning voice into pre-existing user interaction models, such as controlling elements on a computer with a display. The Echo showed the industry how to build a computer from the ground up using voice as its primary user interface (UI). It has no screen and no tactile controls that require constant interaction. The device is mainly speakers and microphones. It’s the dream of Star Trek nerds everywhere because it operates in much the same way the Enterprise Computer did in that you instruct it by talking to it. I’ve gotten a chance to be around an Echo and I personally think it is pretty amazing. If I could reconcile having what amounts to a dedicated listening device in my home, I’d definitely own one but, between my computers and my phone, I think the world’s intelligence organizations have enough to work with.

The big difference between voice interaction for computers today and what was available in the past is that two things have improved substantially: context and linearity. By minimizing or removing the graphical interface layer entirely, voice can be used naturally and conversationally. But how is it done?

What I suspect is that the industry has rediscovered the power of verbs. When I created my UI model, Simplicity, I didn’t focus on applications but on actions. For greater reference, check out my initial idea document on what would become a concept document for a new type of tablet (coincidentally, a very similar product is being produced and will become available next year. My (mostly positive) thoughts on that at a later date):


As I noted in my document, it is very easy to encapsulate a very wide range of functionality with a few verbs. An additional benefit is that a verb-based interaction model is also highly self-organizing. What I didn’t anticipate was that such an interface would also be ideal for voice. However, considering the applicable verbs for a device that requires no tactile control and has no screen, a remarkably simple yet intuitive voice interface is conceivable.

Consider how Amazon’s virtual assistant, Alexa, is currently used. One class of interaction would be actions. In other words, one of the main things that can be done with Alexa is to have it perform an action, for instance: “Alexa, play Pachelbel’s Canon in D.” In this interaction model, the verb “play” is the cue; at least up until this point, the programming model is pretty simple. It gets more complicated when nouns are added to the equation but the overall model at least has the benefit of being very linear and pretty intuitive from the programming perspective, at least in theory (Alexa programmers may seriously disagree but I’m stating that conceptually).

The second class of interactions would be interrogative in nature, such as “who,””what,””where,””when,” and “how.” “Why” would form a special case because this particular interrogative is difficult even in human-to-human interaction. However, even in this case, the interrogative acts as a very simple cue. Nouns definitely complicate this process more than an action-based model and I suspect that it is with these types of inquiries that most voice assistants struggle.

What I see is that, at least at the conceptual level, voice interfaces have the potential for a very high level of power and sophistication. Even more importantly, I see that the programming interfaces also have the potential to be extremely powerful yet very simple, at least conceptually.

It took a minute but it looks like voice interaction may finally be ready for prime time. In my opinion, Alexa was the first really great example of voice as an interface done right but I wouldn’t count out what Apple is doing with Siri and the AirPods are a really innovative way of making that interaction model more mainstream. I suspect that augmented reality (AR) will be the third pillar of Apple’s attempt to redefine human-computer interaction, likely in the form of an eye wearable. My guess is that Apple’s main challenge on that front is making such a device look unobtrusive enough to wear all the time.

And the iPhone? The brains of the outfit, at least until Watch is sufficiently powerful enough and the new interface and usage models are fully perfected.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s