After launching an Amazon Alexa Skill in June, we were eager to enable more people to manage email with voice. In this post, we’ll give a technical breakdown of our voice development process, from why we went down this path to what technology we used.
Why build in-app voice?
We’re owners and fans of both Amazon Echo and Google Home devices. (We actually give an Echo Dot to every new Astro employee, as a way to say “welcome” and to dogfood our own Alexa Skill). But the reality is, you aren’t always near your voice assistant and many people still haven’t gotten on the Echo or Google Home bandwagon. So, in order to enable more people, in more places to manage their inbox with voice, we decided to explore the feasibility of building in-app voice.
When determining how to build in-app voice, we considered a number of options, but had a couple goals in mind:
- Reuse as much code and logic as possible from either our text-based assistant or our Alexa Skill
We’re a small startup and time savings like these go a long way. Given this, we considered api.ai (which we use for our text-based NLP) and Amazon services first.
- Create a smooth user experience with accurate voice recognition
We knew this would be particularly challenging. Siri, Alexa, and Google Assistant have a definite leg up on NLP due to scale. So, in trying to make an experience that felt similarly accurate, we had our work cut out for us.
- Let the server do the heavy lifting
We knew Astrobot Voice required a combination of OS-level APIs and server-side development. We decided to make sure the server was doing the heavy lifting. The benefits of the server doing the most of the work include shared code for both our iOS and Android apps, and also the ability to make changes to the flow on the server without needing to push updated versions of the Astro apps to the app stores.
For our iOS API for voice recognition, we used AVSpeechSynthesizer and SFSpeechRecognizer. SFSpeechRecognizer is only available for iOS 10+, so Astrobot Voice is only available on iOS 10 and 11. This could be a limiting factor for some app developers, but works for us.
For Android, we used the standard Android API for voice recognition, which includes Speech Recognizer and Text To Speech.
For both iOS and Android, we had the option of sending the server recorded text or a text string. We decided to go with the latter option due to cost and time constraints.
On the server side, we used Amazon Lex. Choosing Amazon Lex over api.ai let us reuse and share of a lot of the same logic we already had for Alexa. While we could have reused some of the logic for the text-based version of Astrobot, ultimately we decided we’d save more time and provide a better experience using Amazon Lex. We estimate this saved us 2-4 weeks of a single developer’s time. As we further develop Astrobot Voice and our Alexa Skill, we’ll continue to save time due to this decision.
Here’s the flow of how these services work together to create the Astrobot Voice experience:
Advice to in-app voice developers
First, make sure your service or app lends itself to voice. Adding voice to certain apps right now likely isn’t a priority given the state of voice. Despite Siri, Alexa, and Google Assistant, voice still isn’t the default way of using apps. So, your use case needs to be a fairly obvious one to get traction and bubble to the top of your product roadmap. We saw a very clear use case for email and voice at home, for example, while getting ready for the day or in the car.
Second, we recommend really considering the technology you use, and not reinventing the wheel yourself. There are lots of services and resources to make the development process easier and to get an MVP out into the world. The flipside of that is making sure you have good abstraction on the server side, so you’re not locked into a particular service for intent detection. Since these services are still fairly new and, therefore, constantly evolving, you might eventually (or even quickly) need to switch services. For Astrobot (both voice and text based) we’ve tried Luis, wit.ai, api.ai, and now Lex without needing to make significant changes to our server logic.
We’re excited to be the first email app with a voice assistant built in, and look forward to seeing other apps make progress on voice. In many cases, voice is a much faster way to retrieve information and create new information and we’re anxious to see what’s next.
Special thanks to co-authors Roland Schemers and San Oo